Method and apparatus for textual exploration discovery

ABSTRACT

The present invention relates to a method and system for textual exploration and discovery. More specifically, the method and system provide a text-driven and grammar based tool for textual exploration and textual navigation. The facilities for textual exploration and textual navigation are based on a system of index entries that are connected to the underlying text segments from which the index entries are derived. Text units with particular grammatical, semantic, and/or pragmatic features constitute bundles of sentences or text zones.

This is a §371 of PCT/NO02/00423 filed Nov. 15, 2002, which claimspriority from Norwegian 20015581 filed Nov. 15, 2001, each of which arehereby incorporated by reference in their entireties.

FIELD OF THE INVENTION

The present invention relates to a method and system for textualexploration and discovery. More specifically, the method and systemprovide a text-driven and grammar based tool for textual exploration andtextual navigation. The facilities for textual exploration and textualnavigation are based on a system of index entries that are connected tothe underlying text units from which the index entries are derived. Textunits with particular grammatical, semantic, and/or pragmatic featuresconstitute bundles of sentences or text zones.

Index structures constitute a system of representations of textsextracted from preferably domain specific document collections. Indexesstand in place of the original text of interest to users and constitutethe system's selectivity. The present invention focuses on the rulesapplied for the construction of representations and the process formaking use of the representations during text exploration and textnavigation. The system of representations is made available in apreferred embodiment of an apparatus supporting text explorationactivities.

The present invention focuses on presenting the representations asattention structures making the user aware of the texts' content. Thepreferred embodiment of the present invention supports users when theyformulate their requests and provides for flexible tools that guide theusers' attention to portions of the underlying texts with options forfurther in-depth exploration.

BACKGROUND OF THE INVENTION

The brief description of the problem area is restricted to the two-sidedproblem facing the user instructed to investigate documents. In thissituational context she knows from experience that some particular partsof the documents are more noteworthy for the interpretative task she isengaged in, and that other parts may be considered as more superfluous.Further she finds it difficult to rapidly locate the important parts ina context characterised by time pressure.

The present invention is founded on the assumption that improvedindexing engines, search engines, and other tools that are developed asa response to search related problem are not adequate for theperformance of interpretative tasks involving in-depth investigations oftexts. These tools are primarily seen as addressing another type ofproblem, commonly denoted as ‘information overflow’ and where the goalis to detect a possibly useful subset of documents from www accessiblecollections comprising millions of documents.

Search systems of the prior art are, in large, based on the so-called‘traditional model of information retrieval’. This model is thoroughlycharacterised and discussed in information retrieval literature. A quoteextracted from Blair (1990) indicates the principal features of theproblem in focus: “ . . . the traditional model of information retrievalwhich stipulates that the indexer's (or automatic indexing procedure's)job is to accurately describe the content and context of documents,regardless of how the inquirers might describe that content, and theinquirer's task is to guess how the documents he might find useful havebeen represented. This is the normative model of information retrievaland it is implicit in most information retrieval models.” (1990:189).

The provision of the right information and saving time puts pressure onbetter acquisition procedures and as the amount of information that isavailable is steadily growing, the burden on indexing devices alsobecomes higher. This is commonly denoted as the ‘scaling problem’.

The present invention looks beyond information dissemination as merelymaking more information available. The present invention presupposesintermediaries in the organisation (user community) who gather documentsfrom various sources. Acquisition, segmentation, disambiguation and theunderlying indexing principles are of critical importance for effectivedissemination, searching and use of document collections. The answer tothe specific problems addressed is not found in smarter searchalgorithms or so-called intelligent agents per se, although newfunctionality and new visualisation techniques might help. The answerembedded in the present invention is to get the user closer to thecontent by new representational means and a new set of tools interfacingthese.

The challenge is to transform the essential documents into a system thatdifferentiates between document types and construct representations oftexts extracted from documents in a manner that attract the users'attention. The textual content has to be transformed and reduced into aform that makes the content accessible with less effort and timeexpenditures. Special designed services will add value to the contentrepresentations by applying a particular apparatus for zonation and anapparatus for filtering that delivers results in an interface,preferably denoted as a text sounding board.

The Principle of Text Driven Attention Structures

In order to explain this principle, it is necessary to include a briefreflection on the concepts of ‘meaning’, ‘understanding’, and ‘context’,and with reference to characteristics of text. Preferably thisreflection relates to genres as argumentative text, directive texts andnarratives.

First of all, the comprehension of, and therefore the definition of, theconcepts ‘meaning’ and ‘content’ is dependent on that of ‘context’. Theenvironment of the present invention constitutes two substantiallydifferent parts: 1) Authors situated in a situational context and, forsome reason, produce documents that reflect some of the features of thesituational context and as perceived by the authors. 2) Users situatedin another situational context and, for some reason, have to confrontthemselves with documents in order to read and interpret the ‘meaningsof the author’ mediated in text.

Authors who are situated in particular situational contexts producedocuments, and other actors who are situated in quite differentsituational contexts use documents. Even if the author and user happento be the same person, the situation at the time of writing will bedifferent from a later situation of exploring and reading. The user mayperceive/understand one ‘meaning’ from the text's content in onesituational context, but seeks for another ‘meaning’ from the samecontent in another situational context. The situational context is forinstance influenced by the work task at hand, time available, backgroundknowledge, etc.

‘Meaning’ and ‘context’ appear in varying situations, but are stillmutually related and relative. To situate some words or wordconstellations within the text's inner context, i.e., within the contextof the text itself, will lead the user's attention to a certain place(‘locus’). However, for the user to understand the visualised place asmeaningful, she also has to understand it as meaningful in thesituational context in which she operates, i.e., why she finds itnecessary to explore and read documents.

The concept of ‘meaning in context’ cannot be defined properly since itdenotes a kind of circularity of enclosure. The interrelationshipsbetween ‘meaning’ and ‘context’ can be expressed if ‘context’ is seen aslevels of enclosure. Thus the words' inner context is the words in thesurrounding area and preferably with the document from which the text isextracted seen as the edges of the inner context. A particular text hasalso an outer context, also textual, as defined by other documents insome way related to the situational context in which they were produced.The situational context is the world ‘outside the text’ and each textalways reflects more or less, one or several authors' interpretation ofthis ‘outside world’.

A user situated in a different situational context can thus be madeaware of some features related to the author's interpretation mediatedin text. A text driven generation of text zones reflects some of thefeatures and as related to how the author's focus of attention moves andshifts across the text's collection of sentences. Thus text zonesprovide for artificial horizontal sub-contexts, i.e., horizontal in thatsentences follow each other in sequences, at least within the culturalenvironment of the present invention.

The text zones reflect particular patterns of repetition which whentaken together with words not in particular repeated within a zone,builds up structures of attention originating from the author. Thepatterns of repetition encompass several textual features at differentlevels, i.e. not only lexical features but features related togrammatical form as tense and modality, and superordinate argumentativefunctions signalling particular discourse elements.

A text zone is an artificial or derived horizontal sub-context (withinthe inner context), giving the background information for a particularword occurrence or word constellation. This background informationaffects the ‘meaning’ of the word occurrence as determined by theauthor. Likewise, the background information affects the ‘meaning’ ofthe word occurrence as understood by a user in a totally differentsituational context. The background information can be as significant asthe particular word occurrence when the user decides whether the‘meaning’ or ‘content’ is useful in that particular situational context.

Consequently, the notion of ‘equality’ between words, either the verysame word or its synonyms or near-synonyms, is by definition anambiguous concept. Equality or sameness refers to word occurrences andby some schools of thought, purporting to refer to the ‘same entity’ inthe situational context, i.e. supposed to exist in the world outside thetext. The present invention is based on the assumption that even the‘same entity’ will be perceived differently and that this perceptionagain varies with context. This ends up with an assertion that there areno criteria for determining equality between the very same wordsoccurring in varying contexts. It will therefore not be possible toconstruct a description for a word and its interconnections to otherwords that is detached from context, and thereafter apply thedescription for the same word occurring in various contexts. Since theidentification of text zones is dependent on the identification of wordoccurrence and how they repeat in patterns of fluctuation, the users'recognition and understanding of the word occurrences will be dependenton the text zone in which the word occurs, i.e., the word occurrences'situated background information.

This brief reflection explains why the present invention does not relyon, or is cautious about, the application of general thesauri (such asthe widely used WordNet) or semantic networks as for instance conformingto the syntax defined for Topic Maps (ISO/IEC 13250).

The present invention instead relies on a method and system forestablishing relations between word occurrences, and with respect to thewords' inner context. This explains the principle of text drivenness inwhich the text itself gives the necessary background information for thegeneration and construction of relations between words, where bundles ofrelations form text zones reflecting how the author's attention movesacross the text. When these text zones and particular word occurrenceswithin zones are visualised in a preferred interface, the users'attention will be directed towards these structures seen as horizonsvirtually superimposed on underlying grammatical encoded texts. Thephrase ‘virtually superimposed’ refers to the fact that the structuresare not encoded in the text, rather they are managed in a system ofexternal files and a device that transmits derived information anddisplays it in a text sounding board. By operating on this text soundingboard, the user can directly influence the device that constructsattention structures reflecting the users' explorative moves.

The key concept is that of text driven attention structures reflectingaspects of the authors' attention in their perceived situational contextat the time of writing. The concepts of insight, chance and discoverycovers for the user, and reflecting the knowledgeable user confrontedwith the texts made accessible via a text sounding board, and where theuser operates in totally different situational context. (The concepts ofinsight, chance, and discovery are borrowed form the ancient legendabout the Three Princes of Serendip, and as told in Remer (1965).

The Users' Problem

The user's main problem is to express her ‘information need’. Theproblem for the user is related to the indexing devices (in a continuumfrom controlled to free-text indexing), and not so much related to thesystem's search functionality. (The concept search functionality refersto the implementation of how the system matches the user request againstrepresentations of documents in the system and how the systemcalculates/presents the items most likely to satisfy the user's need).

The main problem for the user is related to the user's ability toexpress her ‘information need’ as a request submitted to the searchsystem. The search request is a search expression composed of a set ofsearch terms and search operators. The search expressions are indirectin that the searches are not executed in the text itself, but in indexstructures that is supposed to represent to the text content (textcontent surrogates). The search system compares the constellation ofterms in the search expression with the system's index terms (documentrepresentations or document vectors).

The index terms in a search expression may be combined in a seeminglyinfinite number of ways and the user will experience uncertainty whetherdocuments are indexed with the terms included in the search expression.Surely, in all information searching, there is an investment of time.Advanced indexing devices aim at reducing search time by trimming thesearch space. However, the point made is that the user will meet thesame type of problem regardless of whether the index structures containso-called free-text terms or terms from a controlled vocabulary (indexterms using the notation form a classification scheme which in fact isan extreme form of summarisation). The index structures may berestricted to chains of nominal expressions and concepts may be relatedby simple semantic links (synonyms, etc), arranged in hierarchicalstructures (broader terms and narrower terms). However, these relationsare always much weaker than the original textual semantic relations thatincorporate textual coherence.

The Search Process is a Linguistic Transformation Process

Empirical investigations reveal several factors explaining the user'sincapability to express their information need in an accurate manner sothat the system produces a result covering the information needs(normally the discussions differentiate between goal-oriented searchesand interest oriented searches). The user is in a situation in which shehas to balance two quite different goals: First of all, she has topredict how supposedly relevant parts of the text are represented in theindex system. Secondly, she must formulate a request that retrieves anumber of items (documents or text segments) that is adequate withrespect to the amount of resources she has available when judging theitems' usefulness.

When performing a goal-oriented search in a domain-specific, rathersmall-scale document base, the user needs a possibility to exploreavailable index terms in order to deliver an accurate request to thesystem. A search result of, let say, 100 to 1000 items (or more) is insome situations of no value to the user. The number of items in theresult list exceeds the user's futility point, or the user's capacity tobrowse/read in order to find information accepted as useful.

A lot of factors influence on the user when she is trying to formulate a‘best match query’ (background knowledge, data base heterogeneity, etc).This process is in fact a linguistic transformation process where theuser has to transform her ideas about an information need to a chain ofnominal expressions. On the other side, document content has beentransformed in another process resulting in lists of isolated concepts.

An isolated term or concept is a word that cannot, in isolation refer tothe meaning mediated in the text (Ranganathan 1967). (An isolatedconcept can be a component in a compound subject in turn being a partfor a complex subject.) This assertion covers both indexes resultingfrom automatic indexing procedures or so-called independent subjectanalysis. Semantic relations that occur in the text cannot be expressedin the index (as opposed to semantic relations encoded in for instancethesauri).

Why Does the User's Request Fail?

A search request may fail for a number of reasons (the request failswhen the system delivers a result that the user finds unsatisfactory).The following list gives a simple overview of some important causesrelated to the use of terms (words, expressions) in the search requests.

Terms are left out (excluded), perhaps because the user assumes thatthey are not present in the system's index structure or that she assumesthem to be of no relevance in a search request or that she believes thatcertain terms do not have a sufficient discriminating ability.

Terms are included because the user thinks that certain words arepresent in documents or represented in the index structures. Automaticprocedures can remove such terms and/or replace them by classifying themas members of a semantic class in a thesaurus. Replacements may be inconflict the user's intention or the idea the user is trying to expressthrough a set of terms (however, systems supporting this option normallyask the user to confirm term replacements).

The user selects terms referring to words that are used at present (newor popular terms) or words related to a specific domain (professionallanguage). Documents of potential relevance may be indexed with termsthat are different from those used at present but referring to the samemeaning. Thesauri inquiries may establish term accordance (terms inrequest and terms in index structure). This strategy however, increasesthe search scope (involves the operator OR) and thereby the result listmay exceed the user's futility point.

The request includes too many terms or terms combined with operatorsthat exclude potentially relevant documents (text segments). Empiricalinvestigations indicate that users are reluctant to alter or remove thefirst 2-3 terms in a combined list. Automatic procedures can adjust thesequence of terms and/or give terms weights according to their positionin a list. If the user considers the first terms as more important thanthe others, these automatic procedures may conflict with the user'sintention. The request includes terms at an abstraction level differentfrom the terms in the index structure. In more advanced systems the useris given the option to select broader or narrower terms. Alternatively,the user can choose operators that move downwards or upwards in a termhierarchy. Depending on the thesaurus, the search scope may accordinglybe too large or too narrow with respect to the user's search intention.

Several Failure Causes may be Present in One Request

The user's linguistic transformation problem is that several of these‘failure causes’ may be present in one search request. The user has nopossibility to evaluate her search request with respect to termsavailable in the index structures. The index terms are ‘hidden’ in thatthe user only can perceive fragments (if the system at all offersoptions for looking into the index system).

The problem convey some resemblance with a situation where to personsare trying to dialog by talking two different languages (the user'snatural language transformed into a chain of terms and the system'sdocuments transformed into an index structure with isolated termswithout relations). The user is in a situation where she tries to learnthe system's language in order to achieve a goal (satisfying aninformation need). However, the learning of a new language presupposesfeedback about why a certain expression does not produce a satisfactorysearch result. No system (yet) provides feedback explaining why thesearch request failed—a complicated feedback if several of the mentioned‘failure causes’ are present in the same request. Since the user cannotinspect the system's language use, she will not be able to correct herown language use when formulating search requests. The only availablestrategy is to proceed tentatively (trial en error) in every new searchsituation (new tasks with new information requests).

Systems of the prior art embody various proposals aiming at constructingdiagnostic devices analysing the user's requests as compared to theresults the user evaluates and marks as relevant. Such diagnosticdevices seem to have problems dealing with the fact that language use isa dynamic entity “whose times of greatest dynamism and change may comein the very process of interacting with a retrieval system” (Doyle1963).

The Present Invention's Solution Proposal

As early as in 1963 Doyle considered the role of relevance ininformation retrieval testing and concluded: “The gradually increasingawareness of human's incapability of stating his true need in a simpleform will tend to pull the rug out from under many information retrievalsystem evaluation studies which will have been done in the meanwhile.”

Doyle argued that the solution to this problem was not to design systemsaround the concept of relevance, but to base design on the concept ofexploratory capability: “the searcher needs an efficient exploratorysystem rather than a request implementing system”.

With reference to this quote, the inventor of the present inventiontherefore basically, addresses the user's problem related to formulatingqueries and providing feedback about to what extent the request matchesthe actual content in the documents/texts. A context-dependent andsituated content representation takes into account the actual situationof the user. The assumption for the present invention is adomain-specific document collection evaluated as worth delivering toprofessionals within a certain user community.

Rather than relying on the user's capability of expressing informationneeds in an accurate manner, the system should provide the user withmechanisms that reflect the actual content in the document collection.The representation of document content must attend to the economy oftime and more costly techniques are justified in terms of offering theuser advanced options for exploring text in order to discover text zonesthat are useful in a given situation. The percentage scores of currentsearch engines are, in this context, entirely inadequate measure of asystem's value for the user. This problem is sought solved byincorporating new text theories and language technology into the fieldconstructing system's selectivity. The apparatuses for segmentation anddisambiguation perform essential pre-processing of the texts in orderfor other apparatuses to construct the preferred selectivity embodied asattention structures supporting individual behaviour during textexploration and navigation.

The interconnected apparatuses as outlined in FIG. 1 provide for a newtype of selectivity. The particular apparatus that visualises grammarbased contacts to the texts prepared for investigation will be explainedin more detail below. The interface that displays these contacts ispreferably denoted ‘text sounding board’ and provides a kind of‘decision support’ in that it exposes the texts' content to the user andshe is free to select her own moves by operating the content of the textsounding board. Her moves and actions are immediately mirrored in theinterconnected text pane as illustrated in FIG. 5.

The selectivity of the present invention incorporates and supports:

-   -   Grammatical information derived from CG-taggers    -   Semantic information and the transfer of techniques related to        thesauri construction    -   Pragmatic information related to text understanding and features        related to the situational context    -   Statistical information derived from applying a reference corpus        and computing keyness, and keyness of keyness    -   Frequency information combined with grammatical information in        relation to interconnected documental logical object types    -   Zonation and filtering realised as intersecting chains, which        embody the various types of information, outlined above        Search Engines do not Solve this Particular Problem

Despite all the work on search and indexing engines over the past 50years, the problem of classifying, indexing and retrieving digitalcontent remains a major problem for unstructured data such as text.Search and indexing engines (as Lycos, Google, AltaVista, InfoSeek, etc)proposes to solve the problem of finding information by constructingindexes from information sources available on the World Wide Web.Oversimplified, this is done by tracing hyperlinks and parsing the pagesthese hyperlinks point to. The URLs are maintained as entries in globalindex tables that these engines create and the pages referenced by theURLs can be retrieved in reply to a search request. Information filterspropose to solve the problem of information overload in that theysynthesise previous user requests into categories that are regularlyinvoked to operate on information streams.

Traditional search systems rely on different indexing devices anddifferent indexing languages vary in the extent to which they use singleor compound terms and hierarchies, whether index terms are controlledfor synonyms or homographs. Free-text indexing devices are oftencombined with controlled vocabularies (assigned keywords). The user cannormally restrict their search scope to certain fields (catalogueelements or Dublin Core Elements such as title, author, date ofpublication, headers, abstracts, and so on) and/or to certain documenttypes. Typical search options are simple searches, category searches(index terms are arranged in controlled hierarchies). More sophisticatedsystems support GREP searches (Get Regular Expressions) which controlthe matching process based on ‘special characters’ included in thesearch string and various types of proximity operators. The employmentof statistical and probabilistic techniques is a broadly acceptedquantitative framework. However, limitations of the statistical approachare still claimed with reference to various retrieval performancemetrics of systems employing statistical techniques is still (inabsolute terms) low.

The Indexing Problem

As mentioned, index structures constitute a system of representations.The concept of representation by definition means that some informationis left out. In order to ensure that the loss is not crucial withrespect to information searches, the indexing strategy should focus onwhich information is expendable and which is not. In the following, someprincipal issues are shortly described. Indexing and classifying(indexing here: using the notations of a classification scheme) appearas a special profession and are often seen bound to retrievalnecessities. Since indexing is bound to technical use in informationretrieval, indexers (persons or programs) must strictly consider a setof representational prescriptions. The myriad of indexing strategies canbe positioned according to combinations of a wide range of dimensions.Search engines operating on index structures to a varying degree includetechniques for integrating (compare, weigh and merge) the index termsacross databases. Representing textual content with compliance toprescriptions may explain the cause of several problems related toretrieval issues.

First of all, prescriptions set the requirement for the index terms,thus it can be the source of the ‘isolated’ descriptors assigned to thedocument at the cost of the textual formulations which may be the bestdiscriminators in a given search situation.

Secondly, different indexing strategies result in different index termsfor the very same textual content (extracted from a document), known asthe inter-indexers' consistency problem, and the problem exists whetherthe indexing is performed by a human or a machine.

The tuning of index terms based on statistical information (wordweighting procedures) may further obscure textual nuances that have adiscriminating search effect. For instance, it is assumed that highlyprofessional authors use a richer vocabulary than more inexperiencedauthors. Lexical style (influenced by personal, social, cultural, andother contextual factors) reflects the author's choices among immensevariations in word constellations used to express more or less the samemeaning. Words like ‘lawyer’, ‘attorney’, or ‘solicitor’ are variationsin lexical style; however, the textual context may reveal deepersemantic variations. Such simple linguistic variations may be capturedin indexing devices with synonymy relations derived from thesauri. Theproblem escalates when considering the fact that similar meanings may beexpressed through sentences having different syntactic structure or wordconstellations that paraphrase single-word terms (‘diseases of children’in stead of ‘paediatrics’).

The issue about lexical style is related to another indexing problem.Selecting the ‘right’ words from a classification system or thesauri canbe quite complicated when indexing documents with an ‘unexpected’ orinnovative content. New terms not covered in the classification systemhave to be projected into existing terms or the indexer has to extendthe classification system so that it reflects the new terms. This lattercase requires human intervention (independent subject analysis), and inprinciple also requires a professional indexer with lexicographiccompetence.

These and related problems explain the viewpoint taken by Langridge(1989): “At present the potential of computers is largely wasted becausethey are merely used as a medium for inferior indexing methods.” Blairgoes even further and claims: “To see the information problem as acomputer problem is to confuse physical access with logical access, orto confuse the tool with the job.” (1990:70). The concept ‘logicalaccess’ in information retrieval refers to issues related to reducingthe number of logical decisions the user must make when searching forinformation.

Focus is on how to identify and represent textual content in atext-driven fashion and provide representations visualised in a textsounding board. These representations are the logical access points forthe set of texts of potential interest to users and constitute. Thepresent invention also includes a rich set of options giving the usersthe opportunity to conduct text exploration and text navigation based onthe constellations of access points visualised in the text soundingboard.

SUMMARY OF THE INVENTION

“TextSounder” is the preferred name of the present invention.

The main object of the present invention is to address the informationprocessing requirements in information-intensive organisations. Morespecifically, organisations with ‘knowledge specialist’ and documentsare an interpretative medium for the enterprises' activities.

The present invention elevates the users' ‘insight’ into the chosenstance of search, offering numerous options for exploring logical accesspoints organised in a set of contacts to the underlying text preparedfor exploration. The present invention enables the user to transformhis/hers insight into textual discovery.

Specifically, the user is given the opportunity to explore such contactsto the text as displayed in a interface with the preferred name ‘textsounding board’ that includes a wide variety of facilities arranged infive different modus operandi. The users' moves and actions whenoperating the text sounding board are immediately reflected in theinterconnected text pane. Preferable, a particular device preferablydenoted as the ‘triple track’ organises contacts in triplets in whichthe contacts give a glance into the word types nearest inner textualcontext. The preferred current display in the text pane ‘follows’ themoves in these ‘rolling tracks’, in which a selection made in one of thetracks influences the display of contacts in two other interconnectedtracks. The triplets of contacts underlying the triple track are foundedon a wide range of criteria as elaborated in the section ‘ZonationCriteria’, subsumed in the section ‘Apparatus for Zonation’.

The selectivity embodied in the triple track requires pre-processing ofthe underlying texts, in particular pre-processing performed by grammartaggers known in the prior art. The section ‘Apparatus forDisambiguation’ briefly outlines the process and the present inventions'adjustments of deliveries from grammar taggers. The prepared texts andall necessary information extracted, and/or derived from the annotatedtexts are stored and managed in a special designed database, managed bya DBMS known in the prior art.

Further embodiments of the invention include detailed design andconstruction of semantic relations between words as they appear in theirinner context. The explanation for the rather cautious approachregarding the establishment of semantic relations is given in thesection ‘The principle of text driven attention structures’ and in thesection ‘Apparatus for Zonation’.

The contacts are inter alia mapped against preferably domain specificthesauri in a target word selection procedure that regulates theassignment of relations in accordance with how the words appear in theirinner context. The target word selection procedure aims at strengtheningthe text zones within a text. Relations that are validated withreference to the texts' content are structured in an evolving thesaurusthat provide for more details about the contacts displayed in the textsounding board. They will preferably support the user in her textexploration tasks in that they only reflect features about wordoccurrences in the current text displayed in the text pane. The newthesauri structures are not superimposed on new texts before the wordappearances in the text are checked in a new cycle of target wordselection.

The users' request posed to the system, or rather the direction of acourse of search (moves) will be influenced by patterns of contactcollocations displayed in the triple track. If these contacts resemble‘something’ that the user had intended to find or look for, the searchcan proceed as planned. If the interests and/or contacts diverge, theuser may wish to alter her focus of the search. The present inventionoffers options for navigating up or down abstraction layers (rounds andlevels within each track in the triple track).

What characterises the insightful user from the more casual one is theability to see a pattern or implication when exposed to it. The presentinvention is designed for users prepared to recognise tripletscontaining signs reflecting the information sought for, and theinvention thus presumes that triplets of contacts (or parts of them)will be recognised as significant when they occur in the windowpanes.The user may discover potential worthwhile material when exposed to thetexts' content and the patterns revealed in the text sounding board.

When a user has evaluated a set of contacts, she can ask for furtherrefinements, which preferably will be embodied in the form of ‘unfoldingpanes’. Much of what users are exposed to will rapidly be discarded. Thetriple track may be considered as an embodiment of an epitomic approach,which guides the users' attention into portions of the text. Theunderlying text is always present and in the most advanced modusoperandi, it is preferred that the users cursor moves in the text panewill be mirrored in the triple track. This preferred embodiment is abidirectional flow of content from the text to the triple track, andfrom the contacts in the triple track to the content in the text.

BRIEF DESCRIPTION OF FIGURES

The invention will be described in detail, with reference to theaccompanying set of figures:

FIG. 1 is a general overview of some of the modules preferablyincorporated in a preferred embodiment of the invention.

FIG. 2 is a schematic representation of an apparatus for acquisition inaccordance with an embodiment of the present invention.

FIG. 3 shows a schematic representation of an apparatus for segmentationin accordance with the present invention.

FIG. 4 is a schematic representation of an apparatus for disambiguationaccording to the invention.

FIG. 5 shows an interface design of the APO triplets.

FIG. 6 shows a representation of APOS and SVOS and how these conceptsare organized in triplets at different abstractions levels.

FIG. 7 gives a schematic representation of the construction of targetword selection lists.

FIG. 8 shows the process of establishing proposed domain codes.

FIG. 9 shows a schematic representation of an apparatus for zonation.

FIG. 10 gives a schematic representation of elements comprised in anapparatus for filtering according with the invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The Apparatuses According to the Present Invention

The present invention embodies a set of interconnected apparatuses (ormodules) that operates on an integrated set of database partitionspopulated with data transmitted by the devices associated to eachapparatus. Search engines known in the prior art are considered as partof the present invention's environment. The set of devices perform awide spectrum of text processing tasks in order to support the user inher interpretative tasks involving the need for text exploration. Theconstructed attention structures and representative samples of the textcontent derived from the underlying texts are presented in an interfacepreferably denoted as a text sounding board. The text sounding boardincorporates a multitude of filtering options and the users' actions areat all times mirrored in the interconnected text pane. The presentinvention embodies a method and system (apparatus) of interrelateddevices (modules) that prepare the texts and make them available forin-depth inspection and analysis.

The text sounding board will preferably consist of a set of specialdesigned windowpanes and according to modern theory within HCI (HumanComputer Interaction). The panes present interrelated ‘contacts’ to theunderlying text and the user is given several options for filtering thecontent presented in the panes and thus filtering the underlying textsfrom which the contacts to the texts are extracted. The user can explorethe text spans displayed in their full textual context in order tobetter decide between options for exploring and navigating through thetext (denoted as options for text traversal).” The followingpresentation describes the interconnected apparatuses in order ofrelatedness. That is, the order of presentation does not correspond tothe order of processing tasks performed in iterative cycles.

Preferred Interface

The present invention prepares text for textual exploration andpreferably presents the results in a special-designed interfacepreferably denoted as a text sounding board. The panes in the textsounding board reflect the textual features that are captured by thepresent inventions' apparatuses. The content in the text sounding boardinforms the user about the features in the text laid out for inspectionin the interconnected text pane. The system of panes and options arearranged in five different ‘modus operandi’. Each of them provides foradvanced tools supporting the user when she is engaged in interpretativetasks and has to figure out what the texts are about, and based on theinformation presented to her, determine what text zones to investigatethoroughly.

Traditionally in information retrieval systems, query expressions areconsidered as representing the users' information need. Theseexpressions can take various forms, from simple free text expressions,keywords or NL (natural language) expressions being parsed andtransformed into a formal query expression. The user's problem willanyhow be the same—the problem of anticipating what words actually occurin the text and how these words are related to each other—either withinsentences, sections or other documental logical object types. Thedisplay of KWIC (Key Word In Context) according to one or several searchterms in the query expression is often used in order to support the userin her retrieval task. However, it is difficult to grasp the text's flowof content from a KWIC. First of all this is because the KWIC onlydisplays the search terms by physical vicinity. Secondly because theKWIC normally is sorted based on centre words (search term) and wordsimmediate to the left and/or right.

The user (dealing with some sort of problem solving) needs an option forexploring representatives of the text content and have they displayed intheir full textual context. The present invention is founded on theprinciple of text drivenness, in which the content presented in the textsounding board actually do appear in the underlying text prepared forthe users' exploration. FIG. 5 outlines a prototypical embodiment of apartition of the preferred interface denoted as ‘triple track’.

Modus Operandi’ Denotes an Arrangement of Panes in the Interface

The present invention embodies sets of devices that operate on thedatabase partitions in order to construct attention structures that areorganised in various layers, preferably denoted as ‘modus operandi’. Thedesign of the modus operandi is inspired by theory from ancientrhetoric, specifically Cicero's ‘De Oratore I.xxxi’. Each modus operandipreferably supports the activities known from ancient rhetoric asInventio, Dispositio, Elocitio, Memoria and Actio. The design modelbased on ancient rhetoric is further elaborated in Aarskog (1999), whichis incorporated herein with reference.

The various modus operandi differ in complexity of underlying necessaryprocessing of the database partitions and intermediate files, thussupporting an option for regulating costs involved, and where costs canbe balanced against the user communities' need for performance/benefit.Data are captured by procedures that generate results displayed in thetext sounding board in which contacts to the underlying texts are madeavailable to the user in order for her to explore the underlying textsand navigate through the texts during interpretative actions.Specifically, the apparatus for modus operandi generates the content ofthe text sounding board, the interconnections between the various panesin the sounding board, and the links between the contacts and theunderlying text displayed in a separate text pane. In particular thedevice denoted ‘triple track’ mirrors ordered and interlinked sets ofcontacts captured by the present invention's apparatuses. By operatingthe triple track, the user can select or combine interlinked contacts byapplying flexible filtering devices, and where the users actions andresults produced are displayed as directly linked contact with the textdisplayed in the text pane. The content of the text sounding boardmanifests the principle of text drivenness.

Layered Set of Database Partitions

An accommodated DBMS known in the prior art embodies interconnectedDatabase Partitions (due to efficiency). Each Database Partition (DPB)embodies multiple files/tables organised in a multi-levelled file system(MFAS) containing information transmitted from the present inventions'apparatuses and devices, and which is input to the apparatuses forfurther processing (see FIG. 1). The collection of database partitionsare documented in and managed via a higher order DBMS (virtual DBMSlayer) denoted as Information Resource Management System, and preferablyorganised as specified in the ISO standard for IRDS known in the priorart.

The naming convention is as follows: DBP Information Word, the DataBasePartition that contains consolidated information about each word in thepre-processed text prepared for text exploration and navigation. Thenaming convention applies to all devices in this presentation, the mostgeneral term to the left, and specific term towards the right. Thenaming convention applies to all components in the present inventions'specification model.

The Present Invention's Environment

User communities preferably influence the design and construction ofparticular applications conforming to the present invention. Searchengines known in the prior art are considered as a component in thepresent invention's environment, and the relations between searchengines and the apparatuses for acquisition embodied in the presentinvention is briefly described.

User Community

Information-intensive organisations understand the importance of highquality document access. Modern organisations are characterised asorganisations of knowledge specialists. Organisational information takesthe form of documents and documents are the interpretative medium thatgives other information meaning within an organisational context. Lowquality access to documents ‘equals’ to low quality access to theorganisation's acquired information and knowledge mediated viadocuments. Recent surveys indicate that executives spend 40% of theirtime dealing with documents and that for information-intensiveorganisations, as much as 90% of an organisation's information iscontained in documents.

Documents in an Organisational Setting

The overall aim of the present invention is described in general termsin the section ‘Field of the Invention’. The concept ‘documentcollection’ is used to describe a set of documents organised accordingto a specific set of criteria, preferably specified by the usercommunity. Preferably, the present invention will focus on ‘closed’document collections in that the documents are related to a specificdomain or field of interest, i.e. share some features related tosituational context. In any organisational setting, document collectionsmay be considered as ‘small-scaled’ as compared to the notion of ‘large’in www environments. Even small document collections or subsets may beconceived as too large for users confronted with them in anorganisational setting. The collection and its content exceed the user'sfutility point.

The present invention is however not restricted to such closed documentcollections, and can be used as a complementary to Information Retrieval(IR) systems and search engines know in the prior art. Given that itbecomes common to deliver well-formed documents appropriately annotatedin XML, or a future meta-language, and common to pre-process texts withgrammar taggers, the present invention can be applied as a kind of‘on-top-of-technology’ as explained in detail below.

Document Landscape

The present invention incorporates a documental classification schemespecifically designed in order to organise and structure the display oftext zones in conformity with document classes reflecting the documents'situational context (see also the definitions of the text's inner andouter context). The classification scheme is founded on theory relatedto how documents may be conceived as positioned according to actorrelations (superior, subordinate, equality) and norms. This model isaugmented by including a more detailed dimension reflecting norms ofregulations and norms of competence, and includes a fine-graineddiversification founded on the principle of workflow and actors seen ina sender-receiver perspective, as well as the intended audience. Thedocument model is further elaborated in Aarskog (1999). The Dublin CoreElement Set attached to each document is correspondingly augmented withelements used for assigning document class descriptors. In a typicalforeseen situation, the user may select a set of texts as current forexploration that originates from different classes. Characteristically,a task requires the user to explore documents as ‘virtually co-existing’and ‘examine them side-by side’: This situation calls for sets ofwork-related documents to be opened/activated simultaneously. Forexample, the set of documents comprises one or several laws/directives,one or several reports referencing the laws, inquiry documents, publicdebates, and so on. In the present invention it is preferred to provideoptions for visualising the texts' or portions of the texts' origin in a2-dimensional space (or 3-dimensional) from which the user promptly cancomprehend the ‘information landscape’ surrounding the texts beinginvestigated. The device for visualisation unfolds the hyper textuallinks between documents and situates the extracted texts in a mannerthat reflects the texts' outer context, as well as aspects related tothe situational context. The device for visualisation operates on thevalues assigned to each document's Dublin Core Element Set, an inparticular the set of elements added in the present invention, andaccording to the document model described shortly above. In accordancewith a system of identifier inheritance and propagation, each text orportion of text being selected by the user can automatically bedisplayed in a plane in which the coordinates are specified withreference to the documental classification scheme (document model).Icons representing texts or portions of texts in the plane ofsituational context may be disclosed by programmed techniques known inthe prior art.

The interconnections between the texts, or text zones being deriveddocumental logical object types and their documental origin, and therebythe documents' DCES values, provides for immediate information about‘location in documental space’. In the preferred embodiment of theinvention, each icon has attached buttons for activating either the DCESvalues and/or contacts that are current via the triple track in the textsounding board. Likewise, the user can move from the content of the textsounding board and have the current text zones visualised in their planeof outer context, and aspects of the situational context. When theknowledgeable user is given a direct overview of the text zones'locality, she may focus her attention on what text zones to opt for aspossibly more useful/important than others. The sections of the planewill make her aware of the text zones' documental origin. The presentinvention emphasises that information about the texts' context must beavailable ‘at all places’ since it is considered as critical for a userconfronted with piles of texts unfolded ‘side by side’. The principle isbased on the concept of ‘topos’ (locus) in the ancient theory ofrhetoric. The devices in the present invention preferably relieve theusers from navigational dislocation known as ‘lost in hyperspace’.

One particular part of the classification scheme prescribes codes(organised in facets) that signify/express the text zones' superordinateargumentative function. The classification scheme adopted is elaboratedin detail in Aarskog (1999). These codes are displayed in a designatedpane in the text sounding board. The user can activate these codes andthereby establish contact with a kind of function-advancing informationlead ‘on top’ of the underlying text. Again, this follows the principleof text drivenness and also reflects how the authors' focus of attentionmoves along lines described in the theory of text linguistics. Forexample, if the user activates a code ‘Problem Indicators’, the textsounding board will display the phrases classified as ‘problemindicators’ in the text, and the phrases from the text classified assuch, are accordingly highlighted in the text pane. Additionally, thetext sounding board can display the nouns and/or verbs (either at thesentence level or at zone level) neighbouring the ‘problem indicators’.Similarly, the ‘triple track’ (i.e., a special device in the textsounding board that contains an ordered set of contacts structuredaround the notion of Subject followed-by Verb followed-by Object) willprovide for more detailed information about the inner contextsurrounding ‘problem indicators’.

The shift of contacts displayed in the text sounding board, depending onthe users' choice of discourse element indicator, is founded onpragmatic reasoning. For example, a list of nouns (with options fordisplay in various types of order and various levels of detail) capturedfrom zones will give the user a first impression of what the text isabout. The user can attain a more detailed impression by splitting thenouns into two broad categories—that of nouns in the role as subject andnouns in the role as objects, preferably displayed in the order ofappearance in the text. The amount of contacts displayed in the panesembodied in the text sounding board can at all times be regulated inaccordance with frequency information and information about density (thedistance between word occurrence embodied in intersecting chains). Ifthe user gets aware of how these contacts are related to zones withrespect to discourse element indicators, this will affect herreflections about the signals from the text sounding board. Forinstance, a set of nouns with the syntactical function of subjects andin order of appearance as {gas plant, power cable, environment,government, etc.} may evoke different thoughts if attached to ‘problemindicators’ than if attached to ‘solution indicators’.

User Profile

The database partition User Profile preferably contains registered useractivities on the text sounding board. The User Profile is derived fromuser requests that the user has marked as successful and/or user accesslog files (with the user's approval). A particular user may haveavailable several User Profiles reflecting various types of tasks.

User Request

Information about the user request (series of moves or actions taken inthe text sounding board) are preferably generated when the user exploresa text sounding board from which she can select displayed contacts,either individual contacts or combination of contacts. The user may alsotransmit traditional free-text search expressions and the text soundingboard will present information about the terms in the free-textexpression provided that the terms match with the text content. Whenexposed to information about possible matches, the user can start theexploration of how these contacts (referring to word types in the textdisplayed in the text pane) can be utilised for further filtering. Usercan also mark sentences in the current text explored and transmit thesesentences to a device that generates a set of search operands bymanipulating entries in the MAFS that contain information in accordanceto the user-selected set of sentences.

User Requests are Divided Into Two Main Forms:

User Request Concept Expression denotes all types of search expressions,for instance in the form of traditional queries, terms selected fromdisplayed lists in the sounding board's panes, or a combination. UserRequest NL Passage denotes requests where the user inputs a text span(written work in Natural Language) that the user finds noteworthy. Seeuser added text.

User Request Annotated

User Requests in Natural Language (NL) form are annotated withgrammatical tags according to the rules underlying grammar taggers knownin the prior art. The annotated User Requests are transformed and storedin a representation depending on the identified grammatical patterns inthe User Request. The can be activated as input (Search Operand) infiltering options, for instance when generating zone traversal paths (orsentence traversal paths if the text has no or few identified zones).The User Request Annotated may preferably be transformed in order toregulate the order of zone traversal. See ‘Zone Traversal PathAdjusted’.

User Request Concept Expression

A User Concept relates to a theme preferred by the user. A conceptrefers to a word or combination of words. If the concept appears in thetext sounding board the user is informed that the concept actuallyoccurs in the current underlying texts prepared for text exploration.The user may restrict the exploration so that the concept (or word(s)referred to) preferably should exist within the same text zone or inadjacent text zones, i.e. and operation connected to the filter ‘ZonesProximity’.

User Request Concept Section

The Concept Section in the DBP User Profile stores concepts known to theuser and may be activated by the device that generates zone bonds. TheConcept Section may be activated when another particular devicetransmits and presents the content in the text sounding board, either bydisplaying only contacts (with links to underlying text) that matchesuser concepts in the Concept Section or by highlighting these contactsin the sounding board. The Concept Section or any part of it can alsosimply be activated so that all matching word occurrences in the textare highlighted. This device is a kind of ‘awareness option’, and theuser can preferably activate this option for all new documents thatenter the system as workflow related to previous documents.

User Request NL Section

The User Request NL Section in the DBP User Profile contains informationderived from any kind of Natural Language Requests. The NL Sectionreflects the user's previous activities. A particular device generatesuser profile spin-off (activation of filtering options in which ‘openoperands’ capture word types in the user request that matches withrespect to grammatical class, etc.) The spin-off is transmitted to thedevice that generates zone bonds and zone traversal paths. A ZoneTraversal Path Adjusted is a modification of the default path based onpre-calculated weights that are adjusted according to informationcaptured from the user request. When activating one of the navigationaloperators, preferably displayed by comprehensible icons in the textsounding board, the user can navigate or traverse zones in the textmatching parts of the spin-off.

The method according to the present invention processes the set ofsentences in the User Request NL Passage in the same manner as any textprepared for text exploration. That is, the User Request NL Passage isentered as input to a grammar tagger known in the prior art, and outputsa User Request Annotated.

When a user enters a Natural Language Request, it is assumed by defaultthat the words with particular syntactical functions are more noteworthyfor the user. The device that generates user profile spin-off willassign a higher weight to these words and locate zones in which the sameword occurrences appear in the same syntactical position. It isimportant to note that this particular device do not intend to operateas a ‘fact-finding system’ known in the prior art. The description givenin the section ‘The principle of text driven attention structures’explains why this type of goal is not an issue in the present invention.The device makes the user aware of zones containing the specific wordsand restricted according to some subset of grammatical information.However, this does not imply that the present invention suggests thatthe located zones reflect some kind of sameness in the‘meaning-relationships’ between the words in the user request and thetext exposed for exploration. The present invention generates attentionstructures and if the user decides to explore/read the content of thezones it is the user with her insight that determines whether there is a‘meaning-relationship’ during her virtually dialogic interaction withthe text.

User Added Text

When a user has marked out text portions considered as important for thetask, she may wish to insert a commentary (note, memo) and a devicestores and manages the users' notes in a separate file. This service ofnote-management is known in the prior art. In addition to technologyknown in the prior art, the present invention preferably will give theuser an option in which she can mark the zone and export the zone intothe current memo. Since each zone inherits properties from the documentthe text originates from, the activation of a memo with a registeredinsertion address i.e. the zone, sentence or word identifier, theactivation of the memo at a later point in time will invoke thereferenced text in the text pane. The notes are treated as ‘user-addedtext’ and seen as interwoven with the current text examined by the user.The user-added text inherits a subset of the properties assigned to thecurrent text (sentence identifiers at the insertion point, links todocument information, etc) and a set of properties related to the act ofadding text (time data, version data, user identifier, etc).

The present invention includes a device that transmits the user-addedtext back into the line of processing. The users' notes (own and/orothers) are then made available for the user to explore in a similar wayas the source text and the content of the user-added text is exposed inthe text sounding board. This service suits well into the notion ofinformation-intensive organisations of knowledge specialists. Thepresent invention provides for a wide range of access to source text andtexts added in line with the users' process of interpretation andcomposition. At present, screen size complicates the display andvisualisation of several source texts and preferably user-added textconstituting the ‘place of investigation’, in which the texts can beopened and compared side by side. This is however a simple technicalrestriction, and it is expected that two-screen or wide-screen workingplaces will become a customary mode of operation for users spending mostof their working time dealing directly with documents (searching,locating, reading, interpreting, and presumably composing new text,etc).

Search Engine

IR-systems and search engines known in the prior art are considered aspreferred technology in the present invention's environment. The aim ofIR (Information Retrieval) technology is to detect (identify,differentiate, locate, and present) a possible useful subset of a hugedocument collection. IR-systems typically incorporate advancedprocedures for indexing and ranking a subset of a document collection.Ranking is commonly based on degree of similarity between vectors ordocument surrogates supposed to represent the content of whole documentsor parts of documents.

The present invention embodies methods and devices that perform a deeperprocessing of individual texts selected as potentially relevant by auser. The collection of apparatuses may be conceived as an‘on-top-of-technology’. Users, in a previous information seeking stage,have submitted request to an IR system (for instance via a searchengine), have browsed through the set of detected and ranked documents,and selected a subset from the detected set as ‘candidate documents’judged to be of potential interest or possibly useful. The apparatus foracquisition can be applied on this user-selected subset of documentcollection detected by and IR-system and mediated as a digital file(expected to preferably be encoded in XML in the nearest future). Asoutlined in FIG. 2, the candidate documents are pre-processed byparticular devices and thereafter transmitted for further processing inthe apparatuses for segmentation, disambiguation, zonation andfiltering.

The Concept of ‘Relevance’

IR-systems known in the prior art commonly adopt a system's perspectiveto relevance. The notion of relevance is taken as a technical termreferring to degree of similarity between a document vector (therepresentation of document content) and a query vector. A query vectoris supposed to represent the user's information need.

The present invention is based on a more human-oriented perspective. A‘traditional query’ represents the user's attempt to verbalise theirinformation need by trying to figure out discriminating terms withrespect to all the millions of document vectors accessible via a searchengine. In the section ‘Field of the Invention’ it is proclaimed thatthe main problem for the user is to express her ‘information need’ as arequest submitted to the search system. The search request is a searchexpression composed of a set of search terms and search operators. Thesearch expressions are indirect in that the searches are not executed inthe text itself, but in index structures that is supposed to representto the text content (text content surrogates). The search systemcompares the constellation of terms in the search expression with thesystem's index terms (document representations or document vectors).Further, it is commonly experienced that the user has no possibility toevaluate her search request with respect to terms available in the indexstructures. The index terms are ‘hidden’ in that the user only canperceive fragments (if the system at all offers options for looking intothe index system).

The present invention does therefore not consider the concept of‘relevance’ in the traditional IR sense. In stead relevance isconsidered as a relative relation perceived differently by individualusers, dependent on type of information need (interest oriented, factoriented, etc), individual information seeking behaviour, taskcomplexity, level of experience, use of sources, etc. In addition theuser's futility point will influence on the user's judgement ofrelevance, in the present invention preferably paraphrased as usefulnessor utility in order to notify a difference between the commonly acceptednotion of ‘relevance’ as a vector similarity indicator (technicalsystem's perspective).

Apparatus for Acquisition

The apparatus or module for document and text acquisition operates ondocuments detected from www sites or other sources of documents inelectronic format. Preferably, the present invention will focus on‘closed’ document collections in that the documents are related to aspecific domain or field of interest. The present invention is howevernot limited to such closed document collections.

FIG. 2 gives a schematic presentation of an apparatus for acquisition inaccordance with the present invention.

The collection of texts is initially retrieved in various formats, andhas to be converted to at least one common format. The concept ‘documentcollection’ is used to describe a set of documents organised accordingto a specific set of criteria, preferably specified by the usercommunity. The present invention operates will preferably operate onsmall document collections, but considered as large to the userconfronted with them in an organisational setting. The term ‘corpus’usually refers to large heterogeneous document collections, althoughcriteria for organising and managing corpora relate to smaller documentcollections as well. The discussion about the apparatus for acquisitionof documents and texts therefore refers to a general discussion relatedto the construction of a corpus. A reference corpus with partitions thatconform to the present invention's preferred document class model isapplied in procedures that calculate genre specific values for keynessand keyness of keyness.

Device Support Document Analysis

Software for quantitative text processing known in the prior art, willbe used for exploring frequency and distribution data, and with respectto the words inner context (words within one text) and outer context(words within several texts extracted from documents with sharedfeatures referring to the situational context). Frequency anddistribution data include: frequency lists, collocations, concordances,consistency checks between word lists referring to various texts, plotsdisplaying the scattering of occurrences, calculating ‘keyness’ (unusualhigh frequent words in one document or document segment as compared to alarger corpus), sorting and filtering words and clusters, calculatingstatistics, exporting filtered word lists, etc. The output in the formof frequency and distribution data, i.e. collocations, concordances,plots, etc, are utilised to support the construction of domain-specificthesauri. A particular device will combine traditional collocations withgrammatical information, specifically the words' grammatical class. Thedevice uncovers patterns that give information about each content word,and preferably constrained with reference to frequency data. Thecombined collocations show, for each content word in the texts extractedfrom a document, how often this particular content word co-occurs withanother content word within a specified number of positions part fromeach other. The concept ‘content word’ normally refers to the four maingrammatical word classes (open-class parts of speech), i.e., nouns,adjectives, verbs, and adverbs.

Document Candidates

Document Candidates are the documents returned in a list (possiblyranked) from the application of one or several search engines. It isfirmly stated that the present invention is not related to what issubsumed under the concept of Information Retrieval systems (IR). IRsystems and search engines are preferably considered as components inthe environment of the present invention (technical artefacts designedand constructed outside the present invention).

The present invention assumes documents detected by an IR system, or analready existing document collection within an organisational setting.

The candidates are analysed by one or several devices performinganalysis at the document level. For instance the device for keynesscalculation may operate on the document candidates as detected by aprevious IR operation. The selected set of documents originally detectedby an external search engine, are transmitted to a convenient DBMS forpersistent storage under the assumption that authority is given.

Document Format

There are a great variety of formats (multiformity), and each formatrequires special treatment. The source texts (doc, rtf, html, SGML, XML,txt, pdf) have to be converted to at least one common format (input tothe grammar tagging being the essential part of the disambiguationprocess (described below).

Multiformity in source texts and source texts with low quality mayresult in a time-consuming format conversion process. For instance,highly formatted texts have words superimposed on background images,words are running in ‘hidden’ columns or tables, the texts arefrequently interrupted by illustrations, misspellings, hyphenation,tabulator marks, single line breaks between paragraphs, missingpunctuation marks, abbreviations, etc. There are standards for documentstructure but not for the authors' writing behaviour. This accounts fora mixed tool-set supporting the conversion processes. The domainspecific collection must take two forms due to the software to beapplied and new special purpose software that is constructed. A PlainText Corpus (PTC) contains plain text files, and an Annotated TextCorpus (ATC) contains the same set of text with annotations. A devicefor text transformation produces the various formats.

Document Candidates Analysed

This denotes the set of documents returned from a device that performsautomatic or semi-automatic quantitative text processing and analysis.This set of documents may be further reduced based on a prescribed setof more detailed selection criteria.

Document Collection

The document collection will include texts that already exist inelectronic format. Data can be acquired by scanning printed material(requires good print and paper) and converted into electronic format.The process is error-prone and expensive and will be performed only whenneeded to meet the users' coverage criteria. Files in pdf (printeddocument format) pose another type of problems. If spoken material is tobe included, these files will not be converted through a transcriptionprocess. The files will be described in an attached Dublin Core recordand if convenient, linked to segments in the written material.

The document collection or part of a document collection(sub-collection) is stored as persistent in a web-accessible systemmanaged by a DBMS known in the prior art. Each document has preferablyattached a Dublin Core Element Set, which include a unique identifier.

The pre-processing steps include format conversion and the partitioningof documents into sentences (and some other types of syntactical/lexicalunits) and indexing of the whole corpus. A complete full-text index ofthe whole corpus makes it possible to perform statistical analysistasks. The software WordSmith or Document Explorer can preferably beused for these types of tasks.

When a document collection is constructed using electronic documents,the documents need to be saved in their original format. This isnecessary for several reasons. The original format often contains usefulinformation, which must be extracted into the metadata descriptions(e.g. the headers of html documents may contain information aboutauthor, keywords, the production date, language versions, formatversions, etc.). This type of information will be extracted and assignedto fields in the Dublin Core element set (DC).

Version data is necessary for two main reasons:

-   -   Used in order to measure progress and statistics (how many files        in the various formats etc)    -   Used in order to know which tools to use in subsequent        processing (transformation to plain text, part-of-speech        tagging, etc).        Device Keyness Calculation

Software known in the prior art computes each document's keyness (wordswith unusual high frequency as compared with a corpus). The words withhighest keyness, above a threshold value, are filtered to include onlynouns and posted in the element Keywords in the Dublin Core Set. Keynessvalues are used for documentation, that is, a part of the metadatamanaged in the Information Resource Management System (IRMS). IRMS isrealised in DBMS software known in the prior art. The corpus used in thekeyness calculation is constructed specifically for the purpose, andalong dimensions following the document model. The corpus constructionis seen as an activity related to the apparatus for acquisition ofdocuments assembled to form the domain specific document collection onwhich the present invention is to operate, including the extraction oftexts from documents classified according to the document model.

Device Support Corpus Construction

Information-based organisations normally have their own ‘in-house’document collections, and regularly incorporate information judged asuseful from external sources. These external sources will preferably bedetected and retrieved by search engines known in the prior art. Theinformation retrieval system's architecture that these search enginesoperate on, often cause a failure of discrimination by delivering aresult list of potential useful documents that exceeds the user'sfutility point. The device for keyness calculation requires a devicethat supports the construction of a reference corpus conforming tocertain criteria for coverage (genre, actuality, etc.). The presentinvention applies various software tools known in the prior art, whichsupport the construction process. Particular programs provideinterconnections between various software tools in order to ease thetransmission of output from one particular device to another. Theseprogrammed connections customise the software tools according to theneeded processing tasks.

The acquisition of data involves the use of general-purpose softwareknown in the prior art. Document Explorer and WordSmith are both systemsused for quantitative processing large text collections and can operateon text collections constituting millions of running words. DocumentExplorer accepts texts annotated with structural information (documentallogical object types as title, header, paragraphs, sentence, etc) andgrammatical information, that is Part-Of-Speech tags (POS) andConstraint Grammar tags (CG tags). WordSmith, and preferably DocumentExplorer can thus be used as general-purpose software in order toconstruct traditional concordance output as for instance a KWICconcordance. Document Explorer and WordSmith can also generate lists ofcollocates and produce a wide range of frequency and distribution datafor various parameters. A particular device is constructed in order forthe general-purpose software to handle grammatical information andintersect this information with distribution data. In particular,‘combined collocations’ are transmitted to devices embodied in thepresent invention, which activate procedures for the identification ofnew words or word constellations conforming to simple grammar patterns.

DBP Information Reference Corpus

The accommodated device supporting corpus construction focus on thequality of the texts collected. To ensure quality, a particular databasepartition (DBP) includes various types of information about the texts.Examples are document source, collection date, person responsible forcollecting it, language, copyright status, dissemination license(permission is obtained, permission denied or restricted), formatinformation, version information, and so on. These records arepreferably stored and managed in a web-based database application, eachrecord giving access to an URL and the document stored as persistent ina web-accessible system (DBP Information Reference Corpus.

The reference corpus can be dynamic and open-ended or the size may beknown from the outset, or at least there is an estimated size of thecorpus. When the size of the corpus is known in advance, this indicatesa target to be reached and marking the end of a data collection phase.If the collection is to be open-ended, the positioning of documents mustbe based on a specified set of criteria that accommodate the defineddocument class model embodied in the present invention.

Users within a user community will have different views about thecategories and subcategories of the texts that are assembled in adocument collection. However such views are often stated in very generalterms, mentioning document types, organisations, events, particularyears, etc. Therefore, the first step is to get the user community toformulate their needs for data in explicit terms. Thereafter decide whattype of data are to be included in the document collection and in whatproportions. The document class model allows for adjustments conformingto user requirements. The reference corpus applied in the calculation of‘keyness’ and ‘keyness keyness’ values must be adjusted to reflect thedetails of the revised document class model.

DBP Information Frequency

Frequency information underlies the processing at all levels (word,sentence, zone, text, text collection).

The present invention applies the widely known inverse documentfrequency adapted form information retrieval literature. This techniquecalculates the relative frequency of words in an item (sentence, zone,text) compared with the word's relative frequency in a set of otheritems. The devices in the present invention also operatesintratextually, it is therefore of interest to calculate for instance aword's relative frequency in a zone as compared to the word's relativefrequency in the whole text or a collection of domain-related text. Thebackground collection of items must be adjusted in terms of genres ineach case based on the filtering purpose or need for discriminationbetween word types. The technique supports the process of identifyingso-called focused words and may be restricted along a variety ofdimensions as described in a set of underlying criteria for textzonation.

Device Support Document Classification

The present invention includes a method and system for organisingdocuments with reference to a model describing document classes. Thedocument model and document instances assigned to the system of classesprovides for a multidimensional representation of the document set. Thismultidimensionality in links between documents will preferably betransmitted to a device for visualisation of texts extracted fromdocuments, and thereby directly give the user insight to aspects relatedto the texts outer context, and aspects related to the situationalcontext.

The document model is primarily reflecting issues from juridical theoryand traditional classification theory, and additionally incorporates thedimension central and peripheral with respect to for instanceorganisational or procedural matters. The document model thus supportsthe definition of hyper textual links between text zones or textsegments extracted from different documents (authority and norm spacemodel). The model's generality at the highest abstraction level, itadapts easily to document collections of interest to various usercommunities related to particular domains (social, cultural, law,business, etc.).

Document Class Model

The document class model is primarily reflecting issues from juridicaltheory and traditional classification theory, and additionallyincorporates the dimension central and peripheral with respect to forinstance organisational or procedural matters. The document model thussupports the definition of hyper textual links between text zones ortext segments extracted from different documents (authority and normspace model). The model's generality at the highest abstraction level,it adapts easily to document collections of interest to various usercommunities related to particular domains (social, cultural, law,business, etc.).

The classification criteria are based on two dimensions describing therelations between actors participating in the interaction/communication.One of the dimensions tells whether the relation between actors issuperior or subordinate or whether the actors are to be considered asequal. The other dimension tells something about the norms influencingthe relations between actors. The norms are demarcated towards norms ofcompetencies. Norms of competencies are divided into two subclassesdenoted as legal authority and other forms of authority. The latterincludes authority experiences through social norms, authority delegatedthrough a decision and with a limited duration, norms in the form ofregulations, standards or other types of qualification norms. The crosspoints between these dimensions yields four broad document classes whichmay be further divided into subclasses, for instance according to moretraditional criteria describing document types (law, regulation, report,etc). The four broad document classes (each class with subclasses) willsupport the need for restricting the search span. Information aboutdocument classes will therefore preferably be included in one of thepanes in the text sounding board. Most retrieval systems offer theoption for restricting search spans by selecting database partitions.The classification criteria underlying the document classes according tothe present invention are however different with respect to the featuresof the situational context that is taken into account. The underlyingcriteria are derived from juridical theory and traditionalclassification theory. The document classes will also support thedefinition of hyper textual links between text zones extracted fromdifferent documents. For instance, text zones extracted from Debatedocuments (e.g. newspaper), may in some way be related to utterances inthe Negotiation documents (e.g. discussions in ministries) and furtheron in Normative Regulations (e.g. laws, directives, regulations). Thetext zones are preferably to be connected in a hypertext system, thatis, predefined links between selected (extracted) text zones (the linksare considered as conceptual pathways through the text base). Mostpreferably a knowledgeable user will be able to construct their ownhyper textual structures superimposed on the text zones they foundnoteworthy and by this operation the hypertext will yield a kind of‘user view’.

Document Class Normative Regulation

This class covers all types of normative regulations, that is all typesof formal, approved norms such as laws, regulations, directions, rules,etc. regulating the enterprise and activities within an institution. TheDCES for this class of documents will include information about legalauthority and actors mentioned in the normative document, preferablydisplayed with superior/subordinate relations.

Document Class Competence

This class covers all types of normative regulations, that is all typesof formal, approved norms such as laws, regulations, directions, rules,etc. regulating the enterprise and activities within an institution. TheDCES for this class of documents will include information about legalauthority and actors mentioned in the normative document, preferablydisplayed with superior/subordinate relations.

Document Class Debate

The class refers to all types of viewpoints expressed in various typesof channels for debates such as speeches, comments/chronicles in alltypes of media, including news reports, interviews, etc. All types ofauthority may be involved, and all types of social norms, and wherethere are equality relations between actors due to the channel used formediating opinions.

Document Class Negotiation

The class covers all documents related to affairs dealt with in anadministrative agency or other institution, etc. Often a legal authorityis involved while the sender and receiver relations reflect actors withequality relations.

Document Relation

The present invention also provides a method for providing a documentallink structure. The link structure is based on a authority and juridicalnorm space model in which documents are organised with respect tofactors defined by the documents situational context and adjustedaccording to the requirements in user community. The multidimensionalrepresentation of the document set is founded on a model describingdocument classes, each class with subclasses according to the documents'status (production date, producer's authority, etc.). At the highestabstraction level, there are four broad classes described in the section‘Document Class Model’.

The concept behind the link structures established is that the usershould be able to identify easily the documents that are most likely tobe relevant with respect to current information needs. The linkstructures are presented to the user as a graphic image with each class,subclass and document represented by an icon. The user can ‘open’ theicons for more information about the documents and this information isalso organised in several abstraction layers. At the upper level theuser can explore document class information, at the next level the usercan explore information encoded in the Dublin Core element set, and onthe most detailed level the user can explore the documents through thebasic triplet structure. The user is given control over the display oflayers and can easily navigate through the document collection. Byincorporating the dimension central and peripheral with respect to forinstance organisational or procedural matters, the link structures arerealised as multidimensional. For instance, a group of peripheraldocuments within one class may be linked to a central document withinthe same class, and central documents can be linked to each other withinthe same class or across several classes (hierarchically or innetworks).

The document class scheme will be used when deciding what type of textsto include in each class/subclass and in what proportion (topicality,coverage, etc.). The decision on a strategy for the size of the documentcollection and composition can vary across these broad document classes.For instance, it may be convenient to decide on a rather closed strategyfor the Normative Regulations, and a more open-ended or mixed strategyfor the other three classes. The users must provide for the selectioncriteria.

The present invention comprises two broad classes of association types,i.e. document level association types and zone level association types,and described in more detail in the table 1 below. Document LevelAssociation Types are links between units that are whole documents. Thecontent of the text sounding board at this level will be informationextracted from the Dublin Core Element set.

These association types reflect aspects of the document's situationalcontext and as prescribed in the document class model.

Zone Level Association Types, preferably including icons referring toidentified indicators for discourse elements. Bu activating the icon thetext sounding board will preferably display the words and phrasesclassified as discourse element indicators. Similarly, a preferred iconrepresents the Dublin Core Element Set, which is attached to all textualunits extracted from a document.

TABLE 1 Document Level Association Types 1 Between Central Documents,same Event or Time Span 2 Between Central Documents, different Events orTime Spans 3 Between Central and Peripheral Documents, same Event orTime Span 4 Between Central and Peripheral Documents, different Eventsor Time Spans 5 Between Peripheral Documents, same Event or Time Span 6Between Peripheral Documents, different Events or Time Spans Zone LevelAssociation Types 7 Between Zones within Central Document 8 BetweenZones across Central Documents, same Event or Time Span 9 Between Zonesacross Central Documents, different Events or Time Spans 10 BetweenZones across Central and Peripheral Documents, same Event or Time Span11 Between Zones across Central and Peripheral Documents, differentEvents or Time Spans 12 Between Zones across Peripheral Documents, sameEvent or Time Span 13 Between Zones across Peripheral Documents,different Events or Time SpansDCES Dublin Core Element Set

The device for identifying and registering information about thedocuments' structure extracts bibliographic data, and preferably from aDublin Core Element Set which is a component in the file ‘InformationDocument Structure.’

Minimum: Dublin Core Element Set (DCES) includes 15 DC standard elements(top level) and user community elements subsumed in one DC element. Thepresent invention incorporates new elements needed for the purpose ofother devices in the present invention, and these new elements aresubsumed under DC.

The extraction or production of a DCE Set requires a tool set formetadata assignment and management. The file containing informationabout document structures should ideally incorporate a metadata section(new segment encoded in document, XML DC) adapting the general schemeDublin Core. Dublin Core is a structure known in the prior art, and isflexible without being to complex and so versatile that any document canbe described with it. Authors can also easily provide metadata bythemselves. Dublin Core is considered as a sort of ‘lowest commondenominator’. The preferred device must collect and record metadataabout documents—including results from text processing (calculations offrequency, keyness, etc) represent the connections between text files(stored in corpus) and generate reports on metadata assignments (corpusresource reports).

DC elements may provide contacts (codes) to some of the facets in theclassification scheme: [cat1 fac1 Sender], [cat1 fac2 Receiver] and[cat2 fac2.5 Time Utterance (Logical Now)].

DCE Relation

The Dublin Core Element Set includes the Element Relation that givesinformation about relations or references to other documents. Besides,within the text, there may also be references to other documents. If thedocuments are denoted by ‘reserved names’ (as in laws), and a limitedset of spelling variants, these document relations can be identified andrepresented automatically. In order to identify cue phrases (signals inthe form of words), it is necessary to use a Target Word SelectionProcedure operating against a thesauri entry for ‘document synonyms ornear-synonyms’. (document={report, appendix, paper, book, article}.

Information about Document Relations is used in order to generate hypertextual links (inter-documental links).

Information about co-references may also be used to position documentsin a documental space (central and peripheral documents in aSemantic-Pragmatic Distance Distribution Model), which may be adjustedaccording to users' preference.

The document class model includes a description of how the dimensions ofthis model preferably will be utilised in order to display text zones,preferably selected by the user, in an authority norm space model.

Apparatus for Segmentation

The concept of text segmentation in the present invention denotes theset of procedures that recognises structural elements depending on theformat of the ATF (Annotated Text File). Text segmentation concernsstructural properties while text zoning concerns thematic properties.

FIG. 3 gives a schematic presentation of an apparatus for segmentationin accordance with the present invention. The apparatus embodies devicesthat deal with techniques for constructing files enriched with tagsdescribing the documents' logical structure. The segmentation processalso includes metadata assignment, and a preferred embodiment of theinvention applies the Dublin Core Metadata Element Set.

The present invention does not provide an unsupervised separateprocedure for additional XML-coding (for instance according to TEI orXML-Schema). Structural encoding of texts is considered as an affairoutside the present invention. The present invention is based on ageneral expectation that most software for the production of documentswill offer options for XML-transformations in the near future.

XML is a proper format in that XML ensures that each text containsself-describing information. The information encoded in the tag systemcan be extracted, manipulated and formatted to various requirements(user requirements or requirements of target software) and text can bequeried and displayed by using both free software tools XML and specialconstructed advanced XML-based tools. The advantages of XML are wellknown. In summary, XML offers a high degree of flexibility regarding thespecification, tuning and optimisation of search selectivity and searchfunctionality. XML-documents (TEI or XML-Schema) support dataindependence and the possibility to define user views similar to thosein traditional DBMS. With external files added, it is possible to manageoverlapping and discontinuous constituents via the intermediate layer inMAFS.

The device for text extraction operates on documents and extracts thedocumental logical object type (DLOT), in which text is one of them.Further it translates the text objects into a stream of contiguous units(also DLOT)—sentences and words. If the document is encoded inXML-format, the procedure extracts the text with the tags preserved.Some of the filtering options assumes at least XML-tags for very commonDLOTs as titles, headers, paragraphs, lists, references, and so on (at aminimum these filtering options require a well-formed XML-document).

The basic elements—sentences and words—are encoded with XML-tags as adata wrapper during transfer of the text between various devices thatsupport a consistent format treatment. A ‘minimal XML-encoding’ ensuresthat the same set of transformation scripts can be applied to parts ofthe collected source texts in one run (without additional tuning).

The device that identifies the documents' structure preferably includesprocedures for unsupervised minimal encoding of DLOTs, and also offersoptions for manual intervention/validation (annotation-editing mode) inorder to ensure the quality of the document collection. (See corpusquality). This unit also provides facilities for the system designerinterested in customising the tags and/or the attributes associated witheach tag. The present invention stores and manages annotated filesorganised in layers (abstraction levels). The annotated files (withversion control) support more advanced applications (presupposingadditional analysis and encoding) and the support for not yet foreseentypes of information needs.

Device Document Structure Identification

This device generates document structure information and the processdepends on Document Format (plain text or document structured with amark-up language such as XML, html, or others).

The Document Structure Identification Device receives a document asinput, determines the meta-language (e.g. XML) which sets the conditionfor the automatic identification of Document Logical Object Types(DLOT). These techniques are known in the prior art.

The Document Structure Identification Device extracts the documentsDublin Core Element Set (DCE Set), or equivalents, and stores thisinformation in a separate register (together with the DocumentIdentifier).

Document

A document is considered as a container of several object types(Document Logical Object Types abbreviated DLOT). The present inventionapproves a wide definition of documents as including all known types ofdata that can be mediated electronically.

Documents embody a hierarchy of parts consisting of: fixed attributessuch as author, date, keywords, and one or several DLOT. Text is a DLOT,which is a natural language text section.

XML is used to represent both the fixed attributes and the DLOTs. TheDublin Core Element Set (DCES) is the template used for representingfixed attributes extracted from the document header (possibly hidden inthe html tags). A simple XML Schema is used in order to represent thetext structure (a hierarchy of parts).

Both the DCES and XML Schema instances are surrogates, representing thedocument (container) and the text (DLOT within the container)respectively.

Identifier Document

Identifier supervision is important. The problem with document overlapis constantly reported as a cause of failure in text processing systems.

A certain document collection produced as a result to one user requestmay contain documents already delivered as a response to an earlierrequest (from the same or a different User/Customer). Each documentextracted from a collection for further pre-processing has thereforeassigned two identifiers—an internal identifier (date+serial number ortime stamp) and an external identifier (preferably extracted from theoriginal document header). The external identifier, possibly togetherwith information about site (www location), a value stored in DCES,makes it easy to identify document overlaps in a particular documentcollection.

Identifier Document External

Each document has assigned an external identifier. This is for instancethe identifier assigned to the document by the document producer andwill preferably be automatically extracted and represented in thedocument's DCES. In this case the Identifier Document External is thesame as the Identifier Dublin Core.

Identifier Document Internal

An internal identifier is assigned to each document and is inherited byall objects that relate to the document. That is, all objects derivedextracted from the document or derived as a result from processing thedocument. Thus text, DCES, picture, table or what ever documentallogical object type, has assigned the Identifier Document Internal (viakey propagation in the MAFS).

DLOT Document Logical Object Type

Text format (plain text or text marked up with XML or the like)determines the automatic identification of DLOT. Sentences are the mostessential DLOT to the present invention. The remainder of thisspecification therefore focuses on text and sentences as the maintextual object type (text constituent).

DLOT Text

The present invention operates on texts extracted from documentspreferably from a domain specific document collection. Text is one ofthe Document Logical Object Types, and in the MAFS the file containinginformation about the text units ‘wraps’ the files containinginformation about the sentences in each particular text. The fileconnections are established via traditional key propagation.

Identifier Text

If the text extracted from one document is one unit, the text identifierequals the document identifier prefixed with ‘txt’. If the text isdivided into subtexts, the identifier is extended with a serial number.For instance, if a report is divided into one subtext for each chapter(determination condition is chapter header), the text identifier is<document ID+txt+serial number>.

All text files transformed or pre-processed by the Text TransformationDevice inherit the Text Identifier.

DBP Information Document Structure

The device for the identification of document structure analyses inputdocument encoded with determination conditions of document elements andseparates the document elements into two main descriptive parts:

-   1) A description of the document contained in the Document Dublin    Core Element Set (DCE Set) (container information).-   2) A description of the logical objects in the document (DLOT) from    which the text, being a part of the author-focused information, is    extracted (content information)

The object type text is transmitted to the CG Parser for analysis.

CG Parsers normally recognises sentences (part-of text) and words(part-of sentence). In addition it is preferred to have a minimal set ofsentence descriptors such as Title, Header, etc

DBP Information Sentence

Sentences are the main logical object types DLOT) processed by thepresent invention. The Device Sentence Extraction fills a file in MAFSwith information extracted from the set of annotated sentences. The file‘Information Sentence’ is populated with output from several devicesperforming various text processing tasks.

The main table has one entry for each sentence identifier (IdentifierSentence).

In the database, different types of information about each sentence andthe set of sentences (intratextual and intertextual) are consolidatedover the Identifier Sentence (normal key propagation).

The database containing Information about Sentences is divided intoseveral interlinked tables (the tables may be considered as storedviews, either temporary stored or stored as permanent tables used infurther processing). The approach is thus traditional DB processingknown in the prior art.

DBP Information Sentence Occurrence

Each occurrence of the DLOT Sentence has attached a series of AttributeTypes describing the sentence occurrence. In the specification documentthese are denoted as SATOT, Attribute Types attached to the Object TypeSentence

DBP Information Word

The DBPs containing information about the words that are registered asimportant for the construction of attention structures, including theconstruction of content displayed in the text sounding board, which ispopulated by several devices.

The information about each word in a sentence includes at minimum {WordIdentifier, Word lemma, Word Position Relative within the sentence,(part of the Word Identifier), {Word Grammatical Information}, {Wordassociations to other files}}.

The Word Information is processed in a device for word frequencycalculation and a device that produces combined collocations. The WordInformation for each sentence is used when calculating the similaritybetween all pairs of sentences in a text. Word Information <is part of>Sentence Information, which in turn <is input to > the Device ZoneIdentification.

The similarity calculations can be tuned to identify spans of sentencerepetition. This type of repetition occurs frequently in longer reports(a signal indicating that the author judges the sentences as important).This type of author-focused information can be used as one of thecriteria when reducing the set of Lexical Chains (used as input whencalculating Zone Density and Zone Weight).

A word is a token. There are other types of tokens such asabbreviations, dates, numbers, etc.). These token types may preferablyhave attached an extra set of attribute types not necessary in the mainword information file.

Frequency Word Level

Each word occurrence has assigned its own identifier (Sentence ID+word'srelative position within sentence). The frequency information is usedwhen calculating the distribution of the lexical chains and whencalculating the density of each chain within each text zone.

Device Text Extraction

The documents in a pre-selected, preferably domain specific, documentcollection are processed by a device for text extraction in which thetextual elements are extracted, and assigned a serial tag, DLOT SerialTag. The device is interconnected to a separate device that recognisessentences and assigns an identifier to each sentence extracted from thedocument. A series of sentence identifiers will form a text span thatmay occur in between other documental logical object types (DLOT) suchas tables, figures, audio, video, etc. The aggregates of sentences ortext spans are consolidated into a single text file. The DLOT serialtags are input in a device that reconstructs the document(reincorporating other DLOTS in between the identified text spans).

The device is included in the present invention in order for theinvention not to be dependent on an explicit encoding of documentallogical object types such as Text (Paragraph, Header (all levels),Sentence). In practice this means that if a document is not encoded inXML (or another structure description language such as SGML, HTML, etc),the device for text extraction simply extracts sentences (recognised astextual elements) as parts of text, and equips each sentence with anidentifier (Identifier Sentence). The sets of sentence identifiers areused as entry points to information about each sentence occurrence(Information Sentence).

The text extracted from one document may also be divided into subtexts(interlinked subtexts that are not consolidated into a single textfile). One possible determination condition for subtexts is sentencedescriptors such as Chapter Title or Header. The device for textextraction will also preferably split larger texts into subtexts. Forinstance, it may be advantageous to split chapters in a report toseparate text files.

Other object types in the document are ignored, however these objects(tables, figures, video, etc) are also equipped with a serial tag inorder to make it easier to reconstruct the original document structurein the display procedures furnishing the text pane. It is mentioned thattext is an interpretative medium that gives tables of data, figures,etc. knowledgeable ‘meaning’ in an organisational context. For thisreason, it is important to also display the objects referred to in thetext.

It the document is properly encoded and conforming to an XML-schema, thetext extraction device will preferably be replaced by software deliveredfrom XML aware software firms. Alternatively the present devise will beadjusted to just equip each recognised DLOT with a serial tag applied inlater reconstruction, and extract those DLOTs marked with XML-tagsrecognised as signifying textual objects.

Device Sentence Extraction

This is a special-purpose Text Extraction Device encompassing alloperations for the automatic extraction of pre-specified sorts ofinformation based on the annotated sentences in a text. In order todistinguish this device from other extraction devices, the presentinvention refers to the name ‘Sentence Extraction’ since sentences arethe main Document Logical Object Type to be further processed by otherinterconnected devices.

Sentence Descriptor

Sentence Descriptor contains information about sentence types in text.The number of information items for each sentence varies.

The device for Document Structure Identification identifies whether ornot a DLOT is a header or something corresponding to a header. If thedocument contains no mark-up, the text is simply extracted and sentencesare used as the single unit for further processing. If titles andheaders are enriched with grammatical tags, these may provide contacts(codes) to the category [cat4 fac0 Subject Matter Complex].

Device Text Transformation

The device for text transformation receives the collection of textobjects (Text <is a> DLOT) extracted from documents.

These text objects may occur in various formats (multiformity). Thedevice for text transformation converts all text objects into at leastone common format. The present invention has chosen Plain Text Format(PTF) since this format is acceptable in most tools used forquantitative processing. In order to produce as reliable text statistics(text metrics) as possible, all files must exist in one common format,and the files must be checked and cleaned for all known ‘junk types’.

The text objects very often contain different types of ‘junk’, as forinstance ‘tag junk’ or ‘typesetting junk’ and so on. Information aboutjunk (trash) and routines used to clean the text objects are stored in adocumentation system (IRMS). The file ‘Information Junk’ keeps record ofall junk types encountered and the documentation system refers toprocedures applied to deal with the specific junk types.

If the DLOT Text is extracted from files annotated in XML, the TextTransformation Device will preferably validate the files according to anXML Schema (well-formed) and store the result as a Text File XML.

A DLOT Text that is not annotated with XML will preferably betransformed to a simple XML version (depending on the format of theinput document). In this case the present invention applies a simple XMLSchema annotating sentences (assigning a Identifier Sentence). TheIdentifier Sentence in a Text File XML has to be matched against theIdentifier Sentence Set in the versions Text File Plain (PTF).

The Identifier Sentence (aggregated into Text Spans for documentscontaining other object types) is input to a special-designed scheme forconserving and converting objects (DLOT) in the source documents logicalstructure (the XML Schema derived during the processing in the devicefor text extraction.

MAFS Multileveled Annotation File System

The different apparatus includes sets of procedures and devices that aretuned according to what object type they operate on and according tospecific features (attribute types) attached to each object type. Theprocedures and devices operate on the various files stored in the MAFSand combine extracted and derived information with information deliveredby other procedures. The combined or derived information is transmittedback into the files stored in MAFS.

MAFS is a file system, which conceptually may be considered as organisedin interlinked levels or layers. The MAFS are managed by a DBMS of theprior art. In addition to the DBMS application that is constructed inthe present invention, there is a superordinate layer denoted as an IRMS(Information Resource Management System) which includes and manages allsorts of documentation as related to MAFS, apparatuses, usercommunities, etc.

MAFS allows for advanced filtering options such as realised in grammarbased request patterns, proximity searching, collocations and limitedaccording to the words' grammatical class or form, semantic queryexpansion, locating items within text zones (a derived documentallogical object type); within zones with a certain density, and so on.Each filtering option is realised by combining simple instructionsapplied recursively and where the intermediate results are stored inseparate files in MAFS.

A storage and management of annotations organised in layers (abstractionlevels) supports data independence and the possibility to define viewssimilar to those in traditional database management systems (including ametadata layer representing the connections between file elements in thevarious layers).

The need for representing overlapping and discontinuous constituentsindicate that a supplementary option is to store and manage these tagsin separate external files. Efficiency requirements indicate theembedded option with the files residing in an XML-aware documentmanagement system. The proposed system will embody both options withfiles stored and managed in layers (a multileveled annotated filesystem, abbreviated as MAFS).

The different apparatus includes sets of procedures and devices that aretuned according to what object type they operate on and according tospecific features (attribute types) attached to each object type. Thevarious devices embodied in the present invention operate on the variousfiles stored in the MAFS and combine extracted and derived informationwith information delivered by other procedures. The combined or derivedinformation is transmitted back into the files stored in MAFS.

MAFS is a file system, which conceptually may be considered as organisedin interlinked levels or layers. The MAFS are managed by a DBMS of theprior art. In addition to the DBMS application that is constructed inthe present invention, there is a superordinate layer denoted as an IRMS(Information Resource Management System) which includes and manages allsorts of documentation as related to MAFS, apparatuses, usercommunities, etc.

MAFS allows for advanced filtering options such as realised in grammarbased request patterns, proximity searching, collocations and limitedaccording to the words' grammatical class or form, semantic queryexpansion, locating items within text zones (a derived documentallogical object type), within zones with a certain density, and so on.Each filtering option is realised by combining simple instructionsapplied recursively and where the intermediate results are stored inseparate files in MAFS.

The layers defined in the MAFS provide the application designer with ahigh degree of flexibility regarding the specification, tuning andoptimisation of search selectivity and accordingly the construction ofattention structures, including the content of the text sounding board.From the bottom layer, the designer (based on user communityrequirements) can extract a subset of the annotations and store these inan intermediate layer. (The bottom layer constitutes the file systemwith all tags except tags assigned to same text spans or part of textspans marking overlapping or discontinuous constituents, and tagsmarking hypertext anchors).

A special purpose device (designer's tool) will provide facilities forthe system designer interested in customising the tags and/or theattributes associated with each tag. The designer will be given optionsfor selecting, accepting, ignoring, restricting, editing (for instancerenaming) existing annotations within a working space (buffer) and storetheir final selections as an annotation perspective. The intermediatelayer is in fact a stored set of files reflecting different perspectiveson the underlying text. The system structure has some resemblance withthe ‘view option’ or ‘sub-schema option’ in traditional databasemanagement systems. When information about hyper textual links betweentext segments is stored and managed in external files, the link type isadded to the search operand set providing the retrieval of pairs of textsegments (or bundles of text segments depending on the link type'scardinality). Pragmatic-semantic link types used as search operand (forinstance <problem has solution>, <more details in>, <agreement between>,<argues against>, etc.) will retrieve text segments reflecting deepersemantic relations than what is included in each of the text segments inisolation. The intermediate and top layers will be dynamically generated(for each text base expansion or changes in user requirements).Therefore new files with annotations may be added upon ‘older files’thus these layers will support future applications (not foreseen typesof information needs). In the automatic mode, the texts are annotated(structure and grammar) without manual intervention. If the automatictools result in ambiguities, these may be corrected by manualintervention by entering an annotation-editing mode (increasing corpusquality). The designer can choose to keep the previous tags or attributevalues in one version and replace them with new tags and/or attributevalues in a new version.

MAFS Bottom Layer

A storage and management of annotations organised in layers (abstractionlevels) supports data independence and the possibility to define viewssimilar to those in traditional database management systems (including ametadata layer representing the connections between file elements in thevarious layers).

The bottom layer is the set of files with all types of tags embedded,that is, annotations for structural information and also special tagsfor text span edges. The latter allows for explicit representation ofword and sentence identifiers (used to represent text span edges). Thebottom layer also contains the file header information.

MAFS Text File Plain (PTF)

Plain text format without annotations is the proper input-format tovarious data processing programs—such as WordSmith, Document Explorer,and ATLASti, various statistical programs, Part-of-Speech taggers,Constraint Grammar taggers, etc. Recently, it is announced that someCG-taggers also accept texts annotated with a minimum XML tag set.

MAFS Acquisition Information

A tailor-made interface provide support for refining these predefinedrules, adding new types of segmentation units, and for performing easymanual interventions and corrections in the segmented files. Thesegmentation module generates Annotated Text Files (ATF) consolidatedand stored in the Bottom Layer of the MAFS.

When tags are embedded (stored in the ATFs), the header (of the file)contains general information about the file. This record kept forversion control includes a set of flags indicating whether the filehas/has not passed through dictionary lookup, part-of speech tagging,cg-tagging, and information assigned after disambiguation, for instanceelements in the Dublin Core metadata records.

This record of information is required in order to supervise thesubsequent processing—for instance when converting from one format toanother. Since the different types of software used during analysis havespecial format requirements, each document will exist in severalversions. Version control is therefore of outmost importance and is partof the corpus documentation. High-quality procedures and consistentinformation about elements in the corpus are essential for measuringprogress, avoiding data duplication, controlling input data quality forlater processing in special-purpose software used when enriching textswith tags or analysing texts, and so on. The format problem may elseevolve to be a bottleneck in the corpus processing. A consistent formattreatment will ensure that the same set of transformation scripts can beapplied to parts of the collected source texts in one run. A XML recordformat will also serve the corpus documentation and each element filledin will be assigned a signature.

Recent reports proclaim that there are two restrictions that make theuse of embedded XML inappropriate for the encoding of syntacticinformation. The structuring rules for syntactic information restrictsthe description variety to one relation, the part-whole relation. Thesestructures can represent a hierarchically arranged sequence of embeddedsegments, but are not capable of encoding syntactic relations or networkstructures. In this structure any higher order elements (e.g. sentences)must embrace a chain of continuous sub-elements/phrases or words). It istherefore claimed that discontinuous constituents cannot be represented.The present invention circumvents this problem by applying a specialdesigned structure embodied as layers MAFS consolidated into layereddatabase partitions, in which information about the lowest leveldocumental logical object type wraps a higher level type, as explainedin the section ‘Apparatus Filtering’.

MAFS Intermediate Layer

The intermediate layer is a set of files containing annotated files(equipped with grammar tags and structure tags), which is generateddynamically generated with subsets of XML-annotations stored in externalfiles (for instance tags for text span edges representing the source andtarget anchors in hypertext structures). Thus it will be possible torepresent different hyper textual perspectives superimposed on the sameunderlying text base. Overlapping and discontinuous constituents will bemanaged via the intermediate layer.

MAFS Text File Annotated (ATF)

An ideal of plain text format is perhaps an inheritance from the periodwhen corpora basically were used for linguistic research. Features suchas font, font size, tables and graphical images are not considered ashighly relevant for linguistic analysis and therefore usually areremoved form the corpus text. However, with respect to informationfiltering application, such information may have its own value. Fontsize may for instance signal that the author emphasise certain phrases(cue phrases), or signal important points made (lead functions), and soon. Section headlines normally signal content (if they are ‘true’ macropropositions summarising the text subsumed under it). Units beingelements in the documents' logical structure (as defined in a DocumentType Definition, DTD) does not pose special problems if they areproperly tagged (SGML/XML).

The present invention is based on a document collection withannotations—grammatical annotations and annotations describing thedocuments' logical structure. The present invention will use grammaticalannotations provided by others, either by applying licensed constraintgrammar taggers or paid services from ‘tagger’ companies. The CG-taggerfrom the Centre for Computing in the Humanities (Bergen, Norway) will beapplied for Norwegian texts. Within EU there are many taggers availableunder licence agreements. The grammatical tags from the various taggersare normalised into a common tag set and converted into XML format.

The present invention will preferably use a special-designed annotationscheme for the documents' logical structure. In a very simple annotationscheme, only sentence boundaries are marked. Annotated texts allow foreasier automatic manipulation and there are several annotation standardproposals (there is not yet a generally agreed standard for textannotation). We have decided to use the annotation framework denoted asthe Text Encoding Initiative (TEI). TEI provides a set of guidelines ofhow a large number of annotation types can be encoded in electronicformat and uses XML as document mark-up (annotation format). In 2001,TEI launched the concept of XML-schema, which will be adapted forstructure specifications. TEI also attends to necessary rules for futureconversions conditioned by technological changes. At present, XML is anindependent exchange format that allows for maximum portability. It isexpected that software producers in the near future will deliverXML-aware software.

MAFS Text File XML

Current standard practice is annotations based on SGML or XML. XML is asubset of SGML (Standardized General Mark-up Language, ISO 8879). XML isa data format for storing structured and semi-structured text intendedfor dissemination on a variety of media or hardware/software platforms.An XML document can be broken (defined) into its hierarchical componentsand stored in, for example, a relational database. CurrentXML/SGML-aware document management systems on the market are usuallybuilt on top of an object-relational database. This is essentially anobject layer ‘placed’ on top of an existing relational database product.XML may also be used as an exchange format for data residing inrelational database systems. The XML tags are used as a data wrapperduring transfer of the text (or other types of data) between systems.

XML (like SGML) is a meta-language and there is no pre-defined list ofelements. The user may name and use elements by their own choice. In XMLthere is an optional mechanism (obligatory within SGML) for specifyingthe elements allowed in a specific class of documents (the class ofdocuments being specified in ISO 15255:1999). The document instanceshave to conform to this type definition or more specifically eachdocument (instance) can be validated against the DTD. A document inXML-format is self-describing and information about the documentrepresented in the tag system can be extracted, manipulated andformatted to the requirements of various target software. XML documentscan be displayed, queried, and manipulated by using XML tools.

MAFS Top Layer

The top layer constitutes files optimised to specific needs withincertain user communities. The layer can be restricted to a subset of theannotated document collection and/or to a limited set of structural,grammatical and semantic tags. If a user community prefer/approvecertain grammar based filters and discard others (consider some filtersas less useful), this layer can be optimised to user requirements.

Apparatus for Disambiguation

This apparatus embodies devices that perform various types of textdisambiguation. FIG. 4 gives a schematic presentation of the ‘ApparatusDisambiguation’ in accordance with the present invention. Languageresources such as corpora, thesauri, lexical databases, grammar parsers,etc represent large-scale investments and the disambiguation of text istherefore based on a reuse and integration of existing resources. Thedisambiguation apparatus deals with techniques for converting outputfrom Constraint Grammar taggers (CG-tagger) into an annotation format incompliance with the structure/architecture specified for theMultileveled Annotation File System (MAFS). According to the invention,it is preferred to extract a subset of the grammatical tags delivered asoutput from CG-taggers. These extracted subsets are converted intotagged entries (both embedded and in external index files), each entrylinked to the words or word combinations in the text.

The disambiguation process also cover approaches related to a device forTarget Word Selection (TWS) in order to improve the strategy of textzonation. (See Apparatus Zonation)

The preferred embodiment of the invention applies extracted subsets ofgrammatical tags (codes) combined with a selected set of semantic codes.The invention applies existing semantic resources encoded in DomainSpecific Thesauri (DST) owned by the user community (or user communityis licensed to use DST).

Information filtering applications involving tasks related to naturallanguage processing require annotated texts. As mentioned above, theterm annotation refers to the marking of information. With respect todisambiguation procedures this mean special codes describing differentlinguistic features that are assigned to the words in the texts. Thefundamental linguistic annotation is the part-of-speech tagging(POS-tagging). This type of annotation is considered as obligatory forinformation extraction and semantic disambiguation. Constraint Grammarsprovide for annotations at a higher level. A Word Sense Disambiguation(WSD) process is based on at least POS-tagging.

Multileveled annotations require a decision on the annotation schemesemployed at each level, and how to convert the output from the variousprocessing tools to formats consistent with the annotation schemes. Thismultileveled approach calls for external storage of annotations (linkedto words, word combinations, phrases or other structural units (textsegments) in the text files). Text zones may be marked with edge-tags(attribute-value pairs), each tag referencing lower level units (andwith pointers to physical addresses).

Text units, defined s series of sentences, may contain words or wordconstellations that refer to lead functions, e.g. in special purposesentences as titles, headers, etc. Such features are treated in theapparatus for zonation. Text units may be classified according to thesuperordinate argumentative function, for instance description of asituation, utterances related to problems, evaluation of problems,problem comparisons, proposed solutions, selected solutions, evaluationof solutions, and so on. Such lead functions vary with text genre, andit is possible to conceptualise superstructures (systems of leadfunctions) for any genre. Lexical signals for such lead functions areidentified and stored/maintained in a separate keyword file (cuephrases). The approach aims at partitioning the content of the textsounding board (i.e., index structure) in that grammar based codes canbe filtered according to whether they are derived from text zones (orother types of pre-defined text segments with semantic-pragmatic codesreferring to lead function). In consequence of that, the user mayrequest a constrained display of the triple track (referring to aspecial purpose index structure denoted as APOS) occurring ‘within’supposedly ‘more relevant’ text spans, for instance all zones encoded asdealing with problems related to the domain in question.

The quality of the system's selectivity is the main issue. Text enrichedwith grammatical and semantic codes (tags) will support better semanticapplications, and improve performance of data exploration in texts.

The source text (plain text files) annotated (enriched) with grammaticaltags is a prerequisite for constructing search macros with grammar basedsearch operands. In accordance with the present invention, grammaticalcoding has its weakness in that it leads to ‘over-coding’. If theextraction procedure is not restricted to certain grammaticalcategories, each word in the text will be assigned a series of valuesreferring to grammatical information (the word's grammatical class andsyntactical function, and other types of morphological and syntacticalinformation).

A set of transference rules influence the design of a new tool set to beused by the application designer during the extraction procedure. Theextraction procedure is also influenced by a set of grammar patternsrealised as building blocks in the search macros (components in thefilter options).

Tags denoting the different grammatical word classes are utilised in thezonation apparatus. Nominal expressions may indicate certain types ofpropositional content, verbal expressions may indicate certain actions,and adjectives or adverbial phrases may indicate certain modes ofachievement as well as the degree of strength related to the sincerityconditions. POS-tags form an important part of the input to the WordSense Disambiguation (WSD) procedure. The results from a Target WordSelection (TWS) procedure may, for some user application, be adequate inthe construction and strengthening of the text zones.

Grammar based search operands combined in search macros (grammaticalsearch patterns, or grammatical request patterns) will retrieve zonesand sentences from the underlying texts. This is however not asufficient filtering (sufficient according to some criteria framing theinformation need). The words coded as nouns, verbs, etc., will have tobe further filtered and validated in order to assign discriminatingdescriptors being the constituent parts displayed in the text soundingboard. For each validation, either by manual intervention and/or aTarget Word Selection procedure (dictionary lookup), the applicationdesigner, or preferably the system can assign one or several semanticcodes to the words (or other textual units, preferably semantic codes atvarious abstraction levels). These semantic codes may preferably beassigned to zones and sentences (or other derived object types such aschains). Further, semantic codes at a lower abstraction level areassociated to smaller textual units such as a text zones as defined inthe present invention.

Part-Of-Speech taggers are classifiers choosing the most likely tag foreach word in a context (normally a sentence), and with reference to agiven set of possible tags. Each word is assigned a tag (or annotation)indicating its morphological category (noun, verb, adjective, . . . )and morphological features like number, gender, tense, and so on(singular, plural, base form, past tense, comparative, . . . ). POStaggers have reached a fairly satisfactory level of accuracy and theamount of such resources available on the WWW is steadily growing. Theiravailability is however highly dependent on the language.

Recent reports confirm that tagger performance to some extent isdependent on the text type (genre). It is proclaimed that there is lackof knowledge regarding performance changes when moving from the trainingdomain (text genre) to other domains. The performance of taggers on acorpus may be uneven (since they represent different underlying theoriesand therefore have different tag sets with respect to coverage and size)and also may have been trained on different text genres. Informationabout Document Class and Text Genre will therefore influence the choiceof grammar tagger if there are several competing taggers available. Thesurveillance of tagger performance in relation to text genre accordinglyinfluences the following tag extraction procedure. That is, the rules oftransference are adjusted to assembled performance data.

If the texts in document collection are annotated using differenttaggers, there is no guarantee for consistency between the variousannotated texts. When the goal is to make essential grammaticalinformation available in a set of annotated files stored and managed inthe intermediate layer of the MAFS, and further consolidated in DBPs,such differences can be minimised by constructing mapping schemes. Thedifferences are systematised and conversion rules integrate the tag setsfrom the various taggers into one consolidated tag set. The techniquesapplied for schema mapping is widely known in the prior art.

In the present invention the procedure for integration and consolidationof various tag sets map the correspondences into a ‘standard denotationscheme’ or ‘tag nomenclature’. The set contained in the Tag Nomenclaturewill be a reduced tag set as compared to the various types ofgrammatical information delivered from the various taggers applied. Thecriteria for reduction reflect decisions made about what types of tagsshould be taken into account in the construction of search macros withsatisfactory discrimination ability. The Tag Nomenclature will beexpanded for every ‘new’ tagger used in disambiguation procedures. Theexpansion is based on data assembled through series of investigationsteps, the most important being the conjunction and disjunction of tagtypes. (For instance, is it satisfactory to define one tag coveringnouns singular as well as noun plural, is it necessary to keep all verbtenses as separate tags or will base form, present and past tensessuffice.) Another important step is to see whether there is a need foradjustments according to different text genres. When theseinvestigations are made and the Tag Nomenclature updated intocorrespondence, the integration and consolidation procedure is to someextent similar to schema integration procedures in traditional data basesystems.

However, since each tagging procedure produces an ‘ATF with GrammarAnnotations’ it is possible to store and manage the original anddetailed annotations in a separate file system ((ATF <part of> BottomLayer) <part of> MAFS)). The original set of grammatical annotated textsare stored and managed in the ‘Bottom Layer’ of MAFS. Detailedannotations in the ‘Bottom Layer’ support experimentation aimed atfinding the ‘best’ tag set for information filtering to each usercommunity requesting the services.

Grammatical Parsing

The present invention identifies these new, seldom continuous but oftenoverlapping text zones by processing grammatical encoded text. Grammartaggers known in the prior art produces the grammatical informationencoded in the set of files stored in a database partition denoted asthe Bottom Layer of a Multi-levelled Annotation File System (MAFS). Thepresent invention includes an apparatus for pre-selecting the types ofgrammatical information to be included in the Bottom Layer, and alsoincluding an apparatus for manual intervention in that most grammartaggers still fail to disambiguate texts with a 100% correctness score.For certain application domains, as for instance in medical journals, itis of outmost importance to have an option for manual intervention.

Disambiguation errors are often caused by spelling-errors. Misspelledwords are not recognised during lexicon look-up, misspelled wordsdisturb the zonation procedure and disturb the frequency anddistribution data, and misspelled words are out of the reach for userstransmitting their queries as free-text queries. The present inventionassumes a programmed connection between the apparatus for validatinggrammar tags and a spelling-corrector known in the prior art.

The grammatical encoded texts stored in the Bottom Layer and managed bya customised DBMS known in the prior art, are further processed by thepresent invention and transformed into a customised XML-format. TheXML-formatted files are organised as an interlinked set, each filecontaining data about different Documental Logical Object Types.Meta-data about documents from which the texts are extracted, are storedand managed according to the rules prescribed for the Dublin CoreElement Set (DCES), known in the prior art. The Dublin Core Element Setis expanded by a special purpose set of attribute types. A device thatcalculates keyness and the ‘keyness of keyness values’ transmits data tosome of these new attribute types. The keyness values are preferablyrestricted to encompass words annotated as being in the grammaticalclasses of nouns and verbs. The device for keyness calculation can betuned towards any kind of text segment (portions of one text).Part-of-speech tagging is used for lexical ambiguity resolution.

A higher level of grammatical annotation is syntactic mark-up, wherefull or partial parsing trees are marked for each proposition. Thislevel of annotation is rapidly developing. A constraint grammarrecognises word-level ambiguities, for example, in a phrase like ‘theclaim’, the word claim is marked as a noun since a determiner is neverfollowed by a verb.

The study of verbs is complex due to the lists of arguments the verbtakes and the types of nouns or noun phrases in the argument positiontogether forming a verb phrase. The verb ambiguity is related to thedifferences in the nouns that co-occur in the sentential structurearound the same verb. The classification of verbs into transitive,intransitive, and transitive/intransitive is one part of thedisambiguation process. According to the grammatical characteristics ofverbs, the list of argument nouns is added. Thus the co-occurrence ofverbs and nouns are of interest, but also the position of the main verband auxiliary verbs relative to the main verb and the nouns' positions.Adverbial particles also play an important role in the semanticdisambiguation of verbs (at least this covers for the Scandinavianlanguages, but these patterns differ from one language to another).

The semantic relations of the co-occurrence of verbs and nouns may beused to resolve some types of ambiguity. The construction of searchmacros is thus dependent on the delivery from constraint grammartaggers, and for each specific user community (typical tasks, typicalinformation needs, etc) the types of grammatical information deliveredare carefully considered in order to design a conceptual framework forfilter options.

The present invention embodies a particular device that generates tripletracks displayed and operable in the text sounding board. The basicgrammatical structure underlying the triple track is the constellationSubject Verb Object Structures (SVOS). These are abstracted into asimilar triplet with facets for Agent, Process and Object (APOS) withassociations to the occurrence sets for each of them (occurrences withassociations to the SVOS being associated to the APOS). Informationabout occurrence sets must be recorded for each text (that is, textextracted from document being a member of a document class) and for eachSVOS selection. These records form the basis for comparisons in order tofind the frequency scores of the triplets and their components.Systematic comparisons may uncover triplets or facets with a highdiscrimination ability, or at least form a ground for selectingessential triplets or facets. The selection of essential triplets/facetswill reside on particularly rest on the zone link sets generated by theZonation Apparatus.

Irrespective of software used during the analytical tasks, the presentinvention presupposes, to a limited extent, access to resources in whichwords with similar meanings are grouped together. Linguistic researchcommunities have produced valuable sources of linguistic information,some of the results are made available either as freeware or it ispossible to acquire special licenses for further use in newapplications. These resources include domain-specific thesauri (thematicthesauri) and more lexicographic thesauri. The important point to bemade is that such thesauri represent existing knowledge and will bere-used if the producers or the copyright acts allow it. A device thatapplies thesauri is described in the section ‘Device Target WordSelection’.

Device Target Word Selection

Considering Word Sense Disambiguation (WSD), the procedure is dependenton what is the ‘unit of meaning’; see section ‘The principle of textdriven attention structures’. If the WSD is based on output fromPOS-taggers, the units are words—and a WSD by simple dictionary lookupswill not be reliable (polysemy, several concept matches for each word,etc). A word as an isolated unit has no semantic discriminationability—in order to make a reliable WSD the word must be classified withreference to the textual context in which it appears. The WSD proceduremust therefore be validated in a particular device designed forcomputer-supported manual intervention.

The present invention is based on the assumption that it is possible toidentify a certain satisfactory level of concept abstraction and underthe restrictions described in the section ‘Zonation Criteria’. Conceptabstraction is the procedure that selects a certain set of concepts in aconcept hierarchy (thesaurus) and traces the set to one or more upperlevel (abstract concepts) or lower level concepts (detailed concepts).The starter set of concepts corresponds to words already identified inthe text being processed. So instead of using the term WSD, a moreappropriate term in the present invention is Target Word Selection(TWS), that is, supplementing identified index entries by selectingwords/concepts from a certain abstraction level in existing lexicaldatabases, that is Domain Specific Thesauri (DST). Recall that thetarget word selection procedures is applied in order to strengthen textzones, as perhaps opposed to the general idea that a particular semanticnetwork can be ‘superimposed’ on any texts.

Concept abstraction is commonly considered as a mechanical operationthat simplifies a concept hierarchy. Concerning index entries, thepresent invention organises these into triplets in which words tagged ascomponents in SVOS (Subject Verb Object Structures) are extracted fromsentences, and thereafter further abstracted into triplets in the formAPOS (Agent Process Object Structures). From empirical (small-scale)investigations it seems clear the form of concept abstraction performedis a promising approach for data reduction. Data reduction is necessaryin order to reduce the set of words displayed in the text sounding boardpartitions. This does not mean that the set of words are eliminated oris not available, bit simply that the amount of words displayed can beregulated as to show portions at the time, preferably organised alongthe dimension from general to specific. The derived semantic relationsto other words (also occurring in the text) will be registered andconsequently may be displayed if the user selects an option ‘displaydetails’ for a current selected word. The original word (being acomponent in SVOS) may preferably be linked to upper level concepts (acomponent in APOS and mainly through IS_A relations). It is thereforefeasible to display the associations (occurrences of a certainassociation type) in simple structured semantic nets (local todocuments, zones or sentences since each word or index entry alsoimplicitly includes references to such units). These semantic relationsbetween words, covering each text individually (intratextual semanticrelations) or consolidated to cover several texts that are extractedfrom documents that share some features in the situational contextdescription. The semantic structures are described according to thesyntax defined for XML.

Since the approach is founded on the principle of text drivenness, thepresent invention avoids some of the known problems commonly denoted asthe ‘consistency problem of semantic indexing’. Due to the maintenanceworkload related to classification structures, it is desirable tominimise the concepts and keep the structure as clear as possible. Thisapproach is subsumed under terms as minimalism and coherence. Minimalismmust be balanced against requirements as semantic discriminationability, which in turn must be considered with reference to theapplication purposes.

Concept abstractions generalise concept descriptions and are obtainedincrementally from texts. Concept abstraction via dictionary lookup maycontribute to an exact and compact concept description development.However, a too excessive abstraction may lower the system'sdiscriminating ability (wrong ‘information’ represented).

These mentions of advantages are thus similar to well-known principleswithin classification theory, and the techniques are in fact a set ofclassificatory data reduction rules (macro rules) aiming at simplifyingconcept hierarchies. Human introspection is needed in order toevaluate/validate the computer-assisted operations, includingcomparisons between variants of the rules applied. This procedure issupported by a device for construction of domain specific thesauri (inthis case, thesauri covering words and concepts as related to texts heldin a document collection of interest to a user community).

If the concept structure in a thesaurus allows for multipleinheritances, the abstraction procedure can either abstract to either ofthem, or both of them. Reports on this subject matter discuss problemsrelated to that the first alternatively may cause an abstraction in the‘wrong’ direction, while the latter may cause redundant semanticambiguity (produces 1:m correspondences). In the present invention,however, a concept from a thesaurus will not be captured if thecorresponding word does not exist in the text, i.e., in theneighbourhood of the target word in the first TWS cycle. Since each wordby its identifier is connected to the underlying texts, it will bepossible to restrict the coverage area. It is assumed that it will berather seldom that the very same word refers to the ‘same meaning’within a short text span within on text. The set of constraints can beloosened as more related texts are processed.

Domain Specific Thesauri

A domain specific thesaurus is small or medium-sized, purporting toexplain the meaning(s) of a word via a concise definition with referenceto a domain of interest. Each entry is commonly connected to otherentries, and as broader terms or narrower terms.

List of candidate terms of the domain can be extracted fromlinguistically processed text corpora. A term is a word that may beassociated with a domain specific concept and usually takes the form ofa nominal expression. The identification and coding must take intoaccount that the same word (or word constellation) may have differentgrammatical functions in the texts. TWS is applied iteratively bymapping the concepts in domain specific thesauri (with morphologicalvariants) against words extracted into the SVOS (Subject Verb ObjectStructures). The concepts in the domain specific thesauri returning withthe value ‘no match’ are then input to a TWS procedure between these‘no-match’ concepts and concepts encoded in a more general thesaurus(Lexicon). This mapping procedure seeks for synonyms and/or abstractedconcepts, and these ‘replacements’ are then mapped against the SVOS in asecond round. The concept abstraction is restricted to certain subsetsof the concepts organised in the general thesaurus, for instance byrestricting the search to certain abstraction levels (up). The finaldecisions about such restrictions will be based on feedback from usercommunities requesting the filtering mechanisms underlying theexploratory search options.

TWS based on domain-specific thesauri (for instance the Petroleum AffairBase or others) may lead to a deeper semantic classification based onthe identification of how a specific word co-occurs with other wordsheld in preferably a set of zone link sets from which the SVOS arederived. The occurrence of two or more words within a well-defined unit(i.e., sentence) is called a co-occurrence. Co-occurrences can bestatistically processed by particular tools computing collocationpatterns, and based on different types of measures. The consolidated setof SVOS will thus not reveal collocations in this sense. The SVOSextracts will however at least reflect how words co-occur withinsentences together with information about the words' grammaticalfunctions. The present invention will preferably realise a programmedconnection to software that can produce these combined collocations(grammatical information combined with frequency and distributioninformation). The combined collocations are stored in the databasepartition containing data about word occurrences. General XML toolscombined with tolls like Document Explorer can produce the frequencyinformation required in certain filter options. The same set of softwaretools can also be applied when generating proximity information beingessential in several grammar-based search macros (Filter Module). Forinstance, one filter option presupposes the activation of search macrosthat identify common nouns tagged as object in one sentence occurring assubject in adjacent sentences or sentences within the same zone(distance operator). The search macro realizes an algorithm forcomputing an adjacency factor (sentence distance between common noun asobject and the same common noun as subject) and use proximity measuresin weighting procedures. Proximity measures are used as input infiltering options (described below).

General Thesauri

A general thesaurus is medium-sized or large, purporting to explain themeaning(s) of a word via a general description. These generaldescriptions, depending on the word's semantic type, may includeinformation that classifies the word into a group of similar words,information describing the properties, information about parts,information about the origin, information about functions, and so on.These relations subsume many other relation taxonomies. The procedurefor Target Word Selection being a constituent part of the ThesaurusExpansion will take advantage of this taxonomy. This type of expansionprocedure is however restricted to words registered in the text'skeyword set, i.e., the words with a certain keyness value assigned to anew element type added to the document's Dublin Core Element Set. Theabstracted or specialised term from the thesaurus is included andregistered if is contained in the text that the keywords refer to, andwhere the word occurs with a frequency above a certain threshold value.The generated code-to-code-links between words meeting the keynessthreshold values are not considered to be a part of the thesaurusstructure, they simply reflect an expansion of the set of keywords.There are specific reasons for the restrictions imposed on the use ofgeneral thesauri, see section ‘The principle of text driven attentionstructures’.

Triple Track

This section outlines the procedure for the construction of interlinkedTriple Tracks made available in one of the modus operandi defined forthe text sounding board. In this description the focus is on two basiccomponents—the SVOS and APOS. These components are formalised as atriplet <concept, association type, context>. This triplet formula holdsfor all textual/contextual levels. For example a document <withindocument collection, text <with document, zone <within> text, sentence<within> text, word <within> sentence, word <is a> subject, word <is a>verb, subject <precedes> verb, etc.

-   -   The grammatical annotations delivered from a constraint grammar.    -   Word level or higher order unit types in the text, specifically        derived zones.    -   The Subject-Verb-Object Structures (SVOS) extracted from the        annotated text or a selected set of sentences, preferably within        zones.    -   The validated SVOS denoted as Agent-Process-Object Structures        (APOS).    -   Zones with assigned APOS (zones are series of sentences with        defined connection points stored in the zone link set).

The distinction between SVOS and APOS follows the traditional divisionof syntactic and semantic types. The SVOS are directly associated to thesentences they are extracted from (associations represented byedge-elements in XML-files and word/sentence identifiers). The APOS area subset of the SVOS and the subset is according to a reduction strategy(semantic and pragmatic criteria). Each APOS is a set of index entriesthat ‘inherit’ the edge-elements from the SVOS [APOS <is derived from>SVOS]. The APOS are thus associated to the underlying sentences (or textzones, i.e., sentences <is part of > larger textual units). The indexentries in APOS are therefore denoted as ‘textual contacts’ or simply‘contacts’ in order to distinguish them from concepts used in thepresentation of ordinary index structures.

The realisation of the association types makes it possible to constructan index system in which the concepts are not only organised inhierarchies, but also in a kind of ‘heterarchies’ (top-down and alsoside-by-side that is, hypotactic and paratactic relations). Theparticular kind of structure is elaborated into detail in Aarskog(1999). The visualisation in an interface will take the form of windowsarranged side-by-side, each window with options for expansion/reduction(more general or more specific terms), and options for displaying theunderlying words as they appear in the text. A preferred embodiment isshown in FIG. 5. The triple set of panes is denoted a ‘triple track’.The figure depicts a prototypical embodiment of the present invention.

The final architecture (interlinked file system, interface, etc.) is tobe implemented in a more robust technological platform (Unix, Java,Lisp, XMUXSL). The underlying data structures in the windowpanesoutlined in FIG. 5 is generated from a CG-tagger for Norwegian, costfree software in research communities and with normal licence agreementsin commercial organisations. As the separate panes (APO) show, theunderlying data structures are consolidated in this prototypicalversion. This means, that if there are several word occurrencesreferring to the same token (word type), the panes will only display theword type. The present prototype embodies the basic functionalityoperating on the system selectivity presented in this document. However,the future use of the mentioned technological platform will includestate of the art principles adapted from the field of Human ComputerInteraction (HCI).

Triplet Formula

As indicated above, the SVO Triplets and APO Triplets are conceptsorganised in structures at different abstraction levels. The tripletformula is [concept <association type> context]. This is shown in FIG.6.

The APOS are derived from SVOS and are a result from reductionprocedures involving: grammar based extraction patterns, principleswithin free faceted classification theory (including concept abstractionand the application of macro rules), and Target Word Selectionprocedures based on Domain Specific Thesauri.

The ground level is composed of Subject Verb Object Structures derivedfrom grammatical annotated sentences, and the sentence grammar is theunit for extraction procedures realised through sets of regularexpressions combined in search macros. What a sentence is about is notnecessarily what its grammatical subject states, however any formalismunderlying the representation of information involves simplification andreduction. Even if the grammar based extraction patterns does notcapture lexical units from all the texts sentences, this does not meanthat the patterns cannot produce a good information representation forexploratory purposes. Nominal expressions can be said to denote thetexts ‘world-building’ elements and the verb phrases what is said aboutthem. This is also implicit in the free faceted classification theory inthat the documents' theme can be inferred from the nominal expressionsin the individual sentences. By using an evolving domain-specificthesaurus in the extraction procedures, it will be possible to tailorthe APOS to support specific user communities. The APOS refer either toone text or a group of related texts, but can also be constrained toonly display SVOS or APOS referring to text zones with assigneddiscourse element indicators.

The SVO and APO triplets are an important component in the system'sselectivity, i.e. content representations. The panes or tracks in thetriple track are an important attention structure in that it reflectssome of the words nearest inner context. When the user selects a wordtype displayed in one of the tracks, the other two tracks areimmediately adjusted to include only those word types that co-occur withthe word type activated. Similarly, if the user activates word types intwo of the tracks, the third track will instantly display the word typethat co-occur with the word types in the two other tracks. By ‘doubleenter’, the text pane will show the word types highlighted and, if theuser selects this option, constrained to zones in which the word typesoccur.

Target Word Lists

Target Word Selection and Domain Specific Thesauri: The semantic netsencoded in thesauri can be searched for concepts and semantic conceptrelations defined as relevant to a domain.

In the proposed approach, according to a preferred embodiment of theinvention, nouns (or nominal expressions) are represented in the S and Ocomponents being <part of> the SVOS. The SVOS are extracted from aselected set of sentences annotated with grammatical information,preferably sentences located within zones as part of a reductionstrategy. A dictionary lookup makes it possible to investigate whetherthese nominal expressions exist in already established semantic nets. Amechanical dictionary lookup will however not tell whether the nouns'sense in the text is similar to the sense given to the same noun in thesemantic nets. Therefore the set of nouns (tagged as S or O) are alsoexamined with respect to how they co-occur within one or several textzones. Identified co-occurrences (collocations) are compared to theconcept relations encoded in the domain specific thesauri applied(consolidated collection of domain concepts).

These target word selection procedures preferably would benefit frompre-processed, encoded information about the domain involving: textswith grammatical annotations and concept hierarchies available either inDomain Specific Thesauri or as on-line lexical resources.

General thesauri contain concepts relevant to all sorts of domains andoften also include indirect relations between concepts. This of courseinfluences the semantic precision when mapping words from a text(grammatical annotated words), either single words or words grouped intoSVOS, against concepts encoded in a general thesaurus. As an example, aconcept in WordNet is an element in a synset (synonym set) and eachelement may have hyperonyms and hyponyms (except elements in the genusposition, the concept having one or several subordinate concepts). Inthe present TWS approach, concepts encoded in Domain Specific Thesauriare mapped against the words in the SVOS extracted from the filesannotated by a Constraint Grammar. That is, the concepts in the thesauriare the source concepts and the words represented in the S and Ocomponents are the target for the mapping procedure.

The TWS will return the values Concept Match or No Match. The presentinvention is based on the assertion that the direction of the mappingprocedure has important practical implications. First of all, it iseasier to supervise and manage the mapping results if the direction isfrom domain specific thesauri, or a domain specific word list, towardsthe SVOS extracted from grammatical annotated sentences. Secondly, ageneral dictionary lookup returns too many synonym proposals andabstractions, and the validation procedure accordingly getstime-consuming. An outline of this concept is given in FIG. 7, and intable 2.

TABLE 2 1 Construction of Target Word Lists For every concept in theselected Domain Specific Thesaurus or Domain Specific Word List (e.g.list of actors, organisations, items, substances, etc.), constructTarget Word Lists to be used in iterative TWS cycles. Focus is onnominal expressions: Type: Target Word List (TWL) <output from process>Process: Target Word Selection (TWS) Type: Target Word <is member of>Type: TWL 1 Validated <is a> Type: TWL 2 Validated <is a> Type: TWL 3Validated <is a> The target words are applied on the noun phrases beingpart of the SVO Triplets. Type: Target Word <is applied on> Type: SVOEntry Noun <is member of> Type: Target Word List (TWL) <is applied on>Type: Thesaurus General (GT) Type: Target Word DST <is a> Type: TWLAbstracted Target Word DST <is a> Type: TWL Synonyms Target Word DST <isa> Type: Value Concept Match <is assigned to> Type: Value No Match <isassigned to> The noun entries registered in the SVO Triplets are asubset of words classified by the CG-tagger as being member of thegrammatical word class ‘noun’. The GWC Noun also includes nominalexpressions, derived from applying regular expressions on the taggeroutput file. Type: GWC Noun <is input to> Type: Filter Noun <is a> Type:Grammatical Word Class (GWC) <is part of> Type: GWC Nominal ExpressionType: GWC Noun Common <is a> Type: GWC Noun Proper <is a> Type: SVOEntry Noun <is subset of> Type: SWC Noun <refers to> 2 Construct TargetWord list to be used in TWS cycle 1 For all single word concepts,construct a list of all morphological variations. These are the targetwords to be searched for in general thesaurus (GT) in order to constructlists of synonyms and abstracted concepts. Type: TWL 1 Validated <isinput to> Process: TWS cycle 1 <is a> Type: Target Word List (TWL) <isderived from> Type: Thesaurus Domain Specific (DST) Type: Target WordDST <is member of> Type: Target Word DST <is a> Type: Target Word Sincethe texts have grammatical annotations, it is possible to use word lemma(i.e., lemma not being the same as word stems), considered as animportant data reduction technique. In case of a match, the word isassigned a domain code. The term ‘domain code’ refers to a conceptencoded in the DST, and these assignments are temporal in each TWScycle. Type: Value Concept Match <is assigned to> Type: Target WordType: Domain Code <is a property of> Type:Target Word 3 Construct TargetWord list to be used in TWS cycle 2 For each DST concept, construct alist of synonyms, output from mapping DST concepts onto concepts in aGT. Type: Target Word <is applied on> Type: Thesaurus General (GT) Forall synonyms, construct a list of all morphological variations. Type:Synonym GT <output from process> Process: TWS cycle 1 <is derived from>Type: Thesaurus General (GT) <transformed into> Type: TWL Synonym TargetWord DST 3.1 Record overlapping DST entry concepts and synonyms.Separate synonymous concepts that are also main entry concepts in theDomain Specific Thesaurus. For instance, if [regulation] is a proposedsynonym to the DST entry concept [law], and regulation is also an entryconcept, put [regulation] into a separate list marking synonymousconcepts that overlap entry concepts. Exclude these from the TWS listfor cycle 2 (they are part of the word list in cycle 1). 3.2 Recordoverlapping synonym sets Each DST entry concept has its own set ofsynonyms. Mark overlapping synonyms with respect to entry concepts.These are to be coded as overlapping when the TWS lists are mappedtowards SVOS in cycle 2. 3.3 Consistency analysis Perform automaticconsistency analysis. Each entry concept with synonyms are assigned toseparate files, overlaps are marked by the value 1 in for example aconsistency analysis performed by WordSmith.. 3.4 Validate proposedsynonym lists Evaluate proposed set of synonyms against SVO Triplets.Apply background knowledge and investigate whether the synonym sensematches the word sense in text files. Record observations in separatefilter list (List for later exclusions of proposed synonyms). Type: TWL2 Validated <is input to> Process: TWS cycle 2 <is a> Type: Target WordList (TWL) Type: TWL Synonyms Target Word DST <is member of> 4 ConstructTarget Word list to be used in TWS cycle 3 For each DST concept, findthe nearest abstracted concept (hyperonym) in the GT (one abstractionlevel at the time). Construct a list of morphological variations. 4.1Record overlapping DST entry concepts and abstracted concepts Separateabstracted concepts that are also main entry concepts in the DomainSpecific Thesaurus. For instance, if [law] is a proposed abstractedconcept to the DST entry concept [regulation], and [law] is also anentry concept, put [law] into a separate list marking abstractedconcepts that overlap with entry concepts. Exclude these from the TWSlist for cycle 3 (they are part of the word list in cycle 1). Type:Concept Abstracted GT <output from process> Process: TWS cycle 1 <isderived from> Type: Thesaurus General (GT) <transformed into> Type: TWLAbstracted Target Word DST 4.2 Record overlapping abstracted conceptsand synonyms for DST entry concepts Construct list of synonyms for theabstracted concepts and perform consistency analysis against thevalidated synonym lists for cycle 2. Type: TWL 3 Validated <is input to>Process: TWS cycle 3 <is a> Type: Target Word List (TWL) Type: TWLAbstracted Target Word DST <is member of> 5 Register all possible targetword constellations (consolidated Target Word Lists) For each DSTconcept, make a register with all variations for the set <DST entryconcept, synonymous concept, abstracted concept>, that is target wordsbeing member of: Type: Target Word List (TWL) <output from process>Process: Target Word Selection (TWS) <is derived from> Type: ThesaurusDomain Specific (DST) Type: Target Word <is member of> Type: TWL 1Validated <is a> Type: TWL 2 Validated <is a> Type: TWL 3 Validated <isa> Each validated domain code has assigned a TWS Cycle Code (kept in aseparate file). Information from validated target word selectionprocedures are input to the thesaurus construction (or expansion)process. Type: TWS Cycle Code <is assigned to> Type: Domain Code Forinstance {<law - twscode= 1><regulation - twscode=2>rule - twscode=3>}{<regulation - twscode= 1><law - twscode=2>rule - twscode=3>}

These steps outline the TWS procedure. In stead of mapping words in SVOSwith the value ‘No Match’ onto concepts in more general thesauri, themapping of concepts in the domain specific thesauri (DST) onto theencoded concept relations in general thesauri (GT) gradually expands thetarget word lists. By incrementally processing one DST concept clusterat the time, it is easier to iteratively keep track of the codeassignments and have better control in the validation procedures. Inthese validation procedures, it will also be easier to include‘knowledge’ about the words in the SVOS that have assigned codes fromearlier cycles. It will also be easier to take into account wordsco-occurring in larger units such as several sentences, for instanceframed in text zones.

The V-component in both the SVOS and APOS is a sort of inner tripletassociation type, connecting the Subject (Agent) and the Object(Object). These inner associations may give guidelines for whichrelations to follow in the dictionaries (which are established encodedsemantic spaces). However, a TWS directed towards relations willpresumably need a more detailed validation and/or humanintervention/correction. The identification of semantic relationsbetween a verb phrase in a text and verbs encoded in a thesaurus isextremely complicated. This can be theoretically explained withreference to Thomas (1995): “i) There is no formal (grammatical) way ofdistinguishing performative verbs from other sorts of verbs. ii) Thepresence of a performative verb does not guarantee that the specifiedaction is performed. iii) There are ways of ‘doing things with words’which do not involve using performative verbs.” (1995:44). Based ontheory and reported experiences with dictionary lookups, it is decidedthat TWS procedures for the verbs will not be performed. Rather theverbs encoded in the V component in the SVOS will be replaced by theverb in its base form (representing the P component in the APOS). Thusverb occurrences as {reduces, reduced, etc} will be replaced by[reduce]. These base forms (lemma) must also include adverbial particles(important in the representation of verbal phrases in Scandinavianlanguages). It may however convenient to group certain verbs in generalsemantic classes, for instance the class ‘express meaning’ withoccurrences as {say, declare, tell, utter, announce, affirm, assert,claim, etc.}.

In a present embodiment of the invention, it is possible to assign codesrepresenting Tense (aspect of time dimension) and Modality, restrictedto the set Past Present and Future. This more detailed grammaticalinformation about verb occurrences encoded in the P-component of APOS,is represented in separate zone link sets (tense zones encoded asproperties to each P-occurrence). The content of the links sets can beintersected with other link sets, and the intersected zones in the textpane will be highlighted accordingly. In the same manner, a user cannavigate the text by following verb tense chains, moving from onesentence or zone to the next with either an instance of a particularverb tense or bundles of sentences with the same verb tense. Thisprovides for a combined thematic and grammatical text exploration.However it is important to be aware of the fact that tense is related toboth the document's production date, and the sentences' inner textualcontext. For example, a quotation can be in the present tense and theinner context will reveal the actual time of the utterance. Tense chainsshould therefore preferably be constrained with respect to a furtherclassification of the nouns, for example nouns classified as referringto important actors, organisations, etc. (The document's production dateis represented as Logical Now, and Past and Future forms in utterancecan be represented as relative to Logical Now. However, this structurewill not circumvent representational problems related to the textualcontext of the authors' utterances. Temporal relations reflect a deepersemantics and a formal representation of these relations must be basedon more thorough interpretations.)

Proper nouns may be recognised and encoded during constraint grammarparsing provided that these nouns are encoded in the lexicon processedduring parsing. The filter options encompassing search macrosidentifying proper nouns of interest to a user community must inaddition include tailor-made word lists referring to organisations,persons, locations, etc. These collections of special terms will beorganised in patterns based on the principles underlying the freefaceted classification formulae. Some current constraint grammars arereported having over 90% precision with respect to the recognition ofnamed entities. Nominal expressions (a word group functioning as a noun)do however cause special problems. A TWS will not resolve semanticambiguities caused by head-words, nouns modified by verbs, etc.

The Target Word Selection and Validation procedure directed towards adomain specific document collections and in accordance with thespecified needs in a user community, include routines for thesystematisation of the grammatical patterns underlying the selected setof sentences and the extracted set of SVOS. This collection of SVOS willserve one main purpose: They are input in search macros in the form ofregular expressions. These regular expressions are more correctlydescribed as ‘building blocks’ or the components in search macros. Thebuilding blocks do not cover the whole SVOS, but they represent regularexpressions targeting the S component and the following V and Ocomponents. Search macros that are based on these building blocks arealso combined into higher order search macros.

It is known that a considerable amount of lexical units are recurringword combinations. Recurring bound word combinations are a typicallinguistic feature of any domain, as with specialised terminology(specialised terminology often takes the form of specific wordcombinations). Signals for specific language that take the form ofcompound terms should therefore be indexed as complex or compound terms.The free faceted classification scheme gives guidelines for the indexentry representation of complex terms. The application of these rules orguidelines leads to the construction of an index subsystem containingcomplex terms referring to typical phrases used in the domain. An indexstructure with phrases (phrase register) is an important informationfiltering tool. The APOS will have higher discrimination ability if theyalso include associations to at least very common phrases within adomain.

The multileveled annotation file system includes records (insupplementary files) of the associations between APOS and SVOS fromwhich the APOS are derived. The system of identifiers (assigned toDocumental Logical Object Types) gives the connection to the underlyingtext units (annotated in the file system). Phrases also occur in synonymvariants, that is, they vary in wording sequence and havetransformational variations (e.g. minister of foreign affairs, foreignminister). In the interface structure, users are given the option todisplay all recorded phrases in addition to the default options.

Bound word combinations cause special problems and research reports seemto indicate that they cannot be treated fully compositionally. If theyare considered as coherent building blocks in language use they must berepresented as such in an index system. Bound word combinations must beaddressed separately, and at present reports indicate progress regardingtagger software and the ability to recognise bound word combination.General-purpose software as WordsSmith or Document Explorer providescomputer-assistance (collocations with different cluster size togetherwith frequency data). In the present invention the following approach isapplied: Particular program combine frequency information (quantitativecriteria) with grammatical information and delivers the result in formof combined collocations. However, knowledge about the domain(qualitative criteria) provides guidelines for term inclusion. The moredifficult part of the problem is related to the determination of whereto locate (link or connect) complex terms with respect to the basicconcepts being the default display option in the triple track (referringto APOS). The most promising solution is to display the ‘core term’(preferably a noun) and with options for displaying details about theseparticular nouns, that is, by activating a certain icon, the display ofnouns is expanded to a display of structures resembling short KWIC (KeyWord In Context).

Lexical units signalling problems, solutions or evaluations (leadfunctions or discourse elements) will also occur as word combinations.Some of these lead functions can be identified by more or lessunambiguous lexical signals (direct cue phrases). Obvious cue phrasesmay preferably be stored in separate word lists, and these word listscan be applied in a TWS procedure aiming at identifying an locatingother words or phrases in the target words' neighbourhood that alsoindicate a discourse element. Chains defined over cue phrases and asintersected with other types of lexical or grammatical chains showinteresting attention structures that may serve users with the need forprofound text exploration. Utterances that implicitly indicate forinstance a problem cannot be captured through automatic procedures,i.e., from the field of text linguistics it is known that a negativeevaluation of something’ may be the only indication of a problem,without using words as (problem, crisis, disaster . . . }.

Assigning Domain Codes

The Target Word Lists are used in a process aiming at assigning DomainCodes to the SVO-triplets, which after a validation procedure isattached to the SVO-triplet transformed to an APO-triplet. SVOS is astructure encompassing the set of SVO Triplets and APOS is a structureencompassing the reduced set of APO Triplets referencing thecorresponding subset of SVO Triplets. The APO Triplets referring tosentences in the text are part of a larger representational unit denotedas Theme Representation. The Theme Representation keeps record withlinks to all Documental Logical Object Types referred to by the themerepresentations. The Target Word Selection cycles with Domain Codeassignment is outlined in FIG. 8 and in table 3.

TABLE 3 CG Tagger Output Parse text and identify syntactic elements andgrammatical functions within sentences. Type: Grammatical Information<Output from process> Process: Text Disambiguation <Is derived from>Type: CG Tagger Output <Is abstracted into> Type: Grammar Pattern <Isassigned to> Type: DLOT Word Type: Grammatical Function (GF) <is a>Type: Grammatical Word Class (GWC) <is a> Sentence Selection & AnalysisSelect and retrieve sentences that are linked to the SVO components.Type: DLOT Word <Is a> Type: Documental Logical Object Type (DLOT) <Ispart of> Type: DLOT Sentence <Is a> Type: DLOT Token Type: FrequencyInformation <refers to> Type: Grammatical Information <is assigned to>Type: SVO Triplet <refers to> The grammatical information derived fromsentences is abstracted into a set of Grammar Patterns (constellationsof regular expressions). Type: Grammar Pattern <Gives rules for> cat5fac0 Subject Matter <Gives rules for> Type: Search Macro Type:Grammatical Information <is abstracted into> Type: Regular Expression<is part of> Extract SVO Triplets (SVOS) Each component in a SVO Tripletis associated to the word units and each SVO Triplet is associated tothe underlying sentence from which it is derived. The nouns and verbs inSVO Triplets are in addition stored in separate word lists. These wordsare subset of the total set of words marked as certain grammatical wordclass in the CG tagger output file. The separate word lists are used infrequency calculations (noun and subject, noun and not subject, etc).Type: SVO Triplet <Is input to> Process: Target Word Selection (TWS) <Isextracted from> Type: DLOT Sentence <Refers to> Type: DLOT Word Type:APO Triplet <is derived from> Type: SVO Entry Noun <refers to> Type: SVOEntry Verb <refers to> Type: SVO Entry Noun <Is subset of> Type: GWCNoun <Refers to> Type: SVO Triplet Type: Domain Code <is proposed for>Type: Target Word <is applied on> Type: GWC Noun <Is input to> Type:Filter Noun <Is a> Type: Grammatical Word Class (GWC) <Is part of> Type:GWC Nominal Expression Type: GWC Noun Common <is a> Type: GWC NounProper <is a> Type: SVO Entry Noun <is subset of> Type: SWC Noun <refersto> For later search macro construction and files of phrase collectionsSystematise the linguistic patterns underlying the selected SVOS andspecify grammar based search macros covering these (combinations ofregular expressions as search operands). Type: Regular Expression <Ispart of> Type: Grammar Pattern <Aspect of> Type: DLOT Sentence TargetWord Selection cycle 1 Apply Target Word List (TWL) for cycle 1 . Theseare the target words to be searched for in the sentence from which theSVO triplets are derived. Current target words to be search for arefetched from this list. Type: Target Word DST <Is a> Type: Target Word<Is derived from> Type: Thesaurus Domain Specific (DST) <Is member of>Type: TWL 1 Validated Type: TWL 1 Validated <Is input to> Process: TWScycle 1 <Is a> Type: Target Word List (TWL) <Is derived from> Type:Thesaurus Domain Specific (DST) Type: Target Word DST <is member of>Concept Match: Identify and record matches from cycle 1 Assign andrecord a pair of codes for each concept match. Type level description ofconcept matches: Type: Value Concept Match <is assigned to> Type: TargetWord Type: Domain Code <is a property of> Type: Target Word Type: TWSCode Cycle <is assigned to> [Type: Domain Code <proposed for> Type: SVOEntry Noun] TWS1 is a code attached to all domain codes assigned in theTarget Word Selection Cycle 1. A word in a SVO Triplet may match withconcepts from several domain specific thesauri or concept clusters. Inthis case, also assign a code for the DST used. Register all matches atend of cycle 1 No Match Return the value ‘no match’ for each word in theSVOS not matching any of the concepts in the domain specific thesauri.Type: TWS Code Cycle <is assigned to> [Type: Value No Match <is assignedto> Type: SVO Entry Noun Register all non-matches at end of cycle 1Target Word Selection cycle 2 Apply Target Word List (TWL) for cycle 2.These lists of synonyms are categories of target words to be searchedfor in the sentence from which the SVO triplets are derived. (A categoryis a list of search operands separated by OR). Current target words tobe search for are fetched from this list. Type: Synonym GT <Output fromprocess> Process: TWS cycle 1 <Is a> Type: Target Word <Is derived from>Type: Thesaurus General (GT) <Transformed into> Type: TWL Synonym TargetWord DST Type: TWL 2 Validated <Is input to> Process: TWS cycle 2 <Is a>Type: Target Word List (TWL) Type: TWL Synonym Target Word DST <ismember of> Concept Match: Identify and record matches from cycle 2 Sameas for cycle 1. Current target words are fetched from the list ofsynonyms. TWS2 is a code attached to all domain codes assigned in theTarget Word Selection Cycle 2. Register all matches at end of cycle 2.Register all non-matches at end of cycle 2 Identify and recordco-occurrences within document or sections of document: Proposed SynonymCodes co-occurring with codes referring to DST Entry Concepts areidentified through assigned TWS Code Cycle (TWS1 and TWS2). Input tothesaurus construction/expansion. Target Word Selection cycle 3 ApplyTarget Word List (TWL) for cycle 3. These lists of abstracted conceptsare target words to be searched for in the sentence from which the SVOtriplets are derived. Current target words to be search for are fetchedfrom this list. Type: Concept Abstracted GT <Output from process>Process: TWS cycle 1 <Is a> Type: Target Word <Is derived from> Type:Thesaurus General (GT) <Transformed into> Type: TWL Abstracted TargetWord DST Type: TWL 3 Validated <Is input to> Process: TWS cycle 3 <Is a>Type: Target Word List (TWL) Type: TWL Abstracted Target Word DST <ismember of> Concept Match: Identify and record matches from cycle 3Identify and record all concepts from the TWS Lists not matching foreach TWS cycle. For each TWS cycle, make an organised list of allconcepts encoded in the Target Word Lists that did not match with any ofthe words in the SVO Triplets. Report on all variations of the set ofcode assignments ({DST entry concept, synonymous concept, abstractedconcept} AND TWS Code Cycle).Apparatus for Zonation

Text zones constitute a fundamental attention structure and areconsidered as derived Documental Logical Object Types (DLOT). Theapparatus delivers compound information resulting from the applicationof a set of zonation criteria or specifically a set of rules directingthe operations performed on the underlying database partitions. Theapparatus or module embodies a method and system for text zoneidentification and incorporates several interconnected reflecting theunderlying zonation criteria, which influence the ‘importance’ assignedto sentences and words in the underlying annotated texts. FIG. 9outlines the interconnected devices in the zonation apparatus.

In particular the devices generate:

A device that produces combined collocations revealing the set ofpatterns combining frequency and distribution data with various types ofgrammatical information attached to each word occurrence. The presentinvention divides the set of patterns into patterns for the words'lexical features, the words' grammatical class, the words' grammaticalform, and for words' syntactical function. The links sets generated forpairs of sentences conforming to these patterns are transmitted to adevice that intersects the link sets with reference to the words'identifiers. The device relates to a method performing zone adjustmentin which the zones' borders are strengthened. Zones embed other zonesand zones are overlapping with reference to the multitude of zonationcriteria. Information about the zones is preferably presented in thetext sounding board with a set of options that reflect the multipleperspectives overlaid on the pairs of sentences encoded in the zone linksets. The user, being engaged in exploring and investigating textportions, can by applying these options shift her focus of attention andaccordingly navigate to text portions reflecting the criteria sheactivated.

A device calculates cohesion relations between each sentence and allother sentences in the text. Zones determined solely on cohesionrelations have weak ‘discontinuity borders’ and are overlapping. Thepresent invention embodies several devices that strengthen zone bordersand zone weights with reference to the multitude of zonation criteriaand constraint rules.

A particular device extensively exploits grammatical syntax informationrelated to nouns, and preferably nouns contained in lists of focusedwords and list of words determined to be of importance to the usercommunity (inventory lists, archive codes, keywords, etc.). Inparticular the device utilises TAM information (tense and modality)related to verbs in the first verb position following nouns in thesyntactic subject position in sentences within text zones. Grammar basedrequest patterns (tuned extraction procedures) identify the relevantsyntax information applied in the text zone identification procedure.The device generates the underlying data set to be exposed in the tripletrack, which is embodied in the text sounding board, preferably in aseparate ‘modus operandi’.

A device embodies a method for the identification and annotation ofimportant words, preferably as related to requirements in a usercommunity, and cue phrases, which are classified as lexical signals forelements in discourse models, in the following, denoted as discourseelements.

A text zone may in exceptional cases consist of a single sentence if thesentence contains words classified as ‘important words’, or classifiedas central ‘cue phrases’.

The zonation procedure compares pairs of sentences, and like in anyclassification the procedure iteratively and for each round (or inparallel) addresses different features, aiming at identifying‘resemblance in some features between sentences otherwise unlike’. Theresemblance determined for text zones is however not only based on wordstems (often seen as the unit in known approaches). The zonationprocedure in the present invention rests on several types of informationproduced by pre-processing devices. The pre-processing stages arepreferably implemented as separate devices in order to regulate costs(cost-performance-benefit issues in different user communities).

Whatever zonation criteria applied, as text zone is defined as a bundleof sentences with two sets of properties:

-   1—The set of properties shared by a pair of sentences, and-   2—The set of properties not shared by a pair of sentences.

In the present invention, the concept ‘properties’ does not only referto lexical properties, but also includes properties related togrammatical form, semantic word classes, and cue phrases indicatingdiscourse elements. The set of zonation criteria defines the rules andguidelines applied in order to determine the properties shared by a pairof sentences are, each property set realised in separate zone link setsand chains. Chains are realised as inverted lists of pointers to thesentences classified according to the criteria. The surrogates or ‘linksets’ defined over the criteria will of course reveal how pair ofsentences differs from each other, and taken together a particulardevice calculates connection points between each pair of sentences inthe underlying text. The zones generated embody a structure in whichzones identified and marked according to one set of criteria, encloseother zones or overlap with other zones identified and marked accordingto another set of criteria.

The intersection of zones based on various criteria reveal attentionstructures that are not immediately available on the text's surfacelevel. The attention structures to a certain extent realizes theintuitive impression one experiences when reading a text—some sentences‘belong together’ and at some location the next sentence ‘for somereason’ is perceived as detached from the preceding sentences. Thereference to ‘intuitive impression’ refers to modern accounts of theconcept denoted as ‘text coherence’. Textual coherence is taken as aninterpretative notion and occurs during the interaction between the textand the reader of the text; see also the section ‘The principle of textdriven attention structures’. A particular reader may conceive twosentences as ‘belonging together’ even if they do not share some lexicalfeatures, i.e., as defined by lexical cohesion.

For example, given that two adjacent zones are separated due todiscontinuity in lexical cohesion features. These two zones may howeverbe similar in other respects. The other type of ‘similarity’ can berelated to bundles of words in a particular grammatical form. Stretchesof sentences including adjectives in comparative or superlative form mayfor example indicate some sort of evaluation or comparison, and takentogether with the verbs' tense and modality (TAM) this may well be the‘similarity feature’ apprehended by the reader. Consequently, differenttypes of lexical cohesion do not, in isolation, constitute an adequateset of criteria for the construction of attention structures. Readersmay perceive and interpret sentences as related based on other featuresthan repetition, or semantic substitution of lexical elements.

Zonation is quite different from segmentation, which denotes theprocedures aiming at identifying the structure in a document, i.e. theidentification of the documental logical object types and theirinterrelated arrangement within the document. The zonation of a text isa complex structure of interdependent text spans whose distribution,relations and properties are determined by the similarities between textconstituents—sentences and words. Text is one out of many differentdocumental logical object types and the present invention addressesin-depth processing of textual content.

Zones are text areas where lexical chains intersect and where there arebundles of members from each lexical chain. Data about intersections andbundles form a transparent layer of sentence links superimposed on thetext. The data is transformed to a representational format and displayedin special-designed interfaces admitting the user to get preferablyclarifying impression of the texts' surface.

Certain textual areas are distinguished from surrounding areas andseparate elements within the zones can be used as navigational aids toother zones. The textual elements (sentences and words) and theirstructural relationships are brought to the surface in a text soundingboard together with options for text exploration and navigation. Thetext sounding board is a kind of ‘textographic’ map showing the way totext spans with certain features conforming to the zonation criteria.

Device Frequency/Grammatical Distribution Calculation

Methods and systems for the computation of collocations are known in theprior art. The device included in the present invention generates collocations combining frequency data and grammatical information attachedto each word occurrence. The DBP Word Information is input to the devicethat produces these combined collocations that are stored and managed inthe Intermediate Layer of MAFS. The computed collocations are utilisedas support in the apparatus that generates attention structures. Thefiles uncover which grammatical request patterns that are the mostfavourable in the various sets of text. There is a very wide spectrum ofgrammatical request patterns, and it is assumed that performance willbenefit by knowing in advance which of these patterns to activate, andpreferably also in what order since results from one pattern aretransmitted to another pattern via intermediate files.

Prior to the generation of attention structures, the collocation filesthus provide useful information regarding the activation ofgrammar-based sets of request patterns. Each set of request patterns isdefined with respect to a search intention.

In the example denoted as ‘Pattern—modality associated with the word‘company’, the collocation file tells that the request pattern in theforms below, should preferably be iteratively applied with the distanceoperator from 1-3 to the left or right (distance operator is embodied as‘open operator’ that iteratively can be adjusted). The captured phraseoccurrences with the modal verb ‘shall’ in the first or second positionto the right can be given a higher weight with regard to the rhetoricalinterpretation of ‘obligation’ (depending on the sender of the documentas encoded in the document's Dublin Core Element set).

-   [((modal verb AND verb present)<distance-right=‘open    operand’>(noun=‘company’))] or the pattern-   [((modal verb AND verb present)<distance-left=‘open    operand’>(noun=‘company’))]

The process of calculating frequency and distribution and according togrammatical information can be further specified into subsets as to whatlogical object types they are to cover for.

DBP Information Sentence

Sentences are the main logical object types DLOT) processed by thepresent invention. The Device Sentence Extraction fills a file in MAFSwith information extracted from the set of annotated sentences. Thefiles with sentence information are populated with output from severaldevices performing various processing tasks. These files areconsolidated in the database partition DBP Information Sentence.

The set of attribute types attached to the documental logical objecttype sentence, is denoted SATOT, including the set: {IdentifierSentence, Sentence Class, Sentence Density, Sentence Descriptor,Sentence Length, Sentence Length GC, Sentence Length GF, Sentence LengthGC Relative, Sentence Weight {Word Set ID}}.

In the database, different types of information about each sentence andthe set of sentences (intratextual and intertextual) are consolidatedover the Identifier Sentence (normal key propagation). The databasepartition containing information about sentences is interlinked with theDBP Information Word Occurrence via the identifier given for the set ofwords being registered constituents of a particular sentence. The DBPInformation Word is filled with data generated by the majority of thedevices, and where data intermediary are stored in MAFS before theygradually are consolidated in this particular DBP. The DBP contains allinformation applicable in the construction of attention structures andthe construction of portions transmitted to the panes in the textsounding board. The information about each word in a sentence includesat minimum: Identifier Word, Word Grammatical Class (GC), WordGrammatical Function (GF), Word Length, Word Lemma, Word PositionRelative, Word Reading, {Word Semantic Code}, Word Stem, Word Weight}.This particular attribute set attached to the documental logical objecttype ‘word’ is denoted WATOT ‘wrapped’ by SATOT via common keypropagation. The DBP Word Information Occurrence is further processed ina device that generated combined collocations (frequency and grammarbased information). The information is consolidated to an upper leverdenoted as DBP Information Word. The DBP approach follows accepted DBdesign methodology known in the prior art.

Device Zone Identification

The present invention embodies an apparatus for identifying text zonesthat support the selection of and access to portions of text forexplorative traversal and navigation. The zones are bundles of sentencesthat support the exploration of text when a user seeks to get aware ofits ‘aboutness’. The main procedure for zone identification, which inits basic form is a kind of cluster analysis, is preferably adequatelycarried out without needing access to general thesauri.

The text zonation procedure bears some resemblance with known proceduresfor text segmentation, and literature on this matter very often cites amultitude of research reports addressing the issue of lexical cohesionunderlying a segmentation algorithm. Several of the reportedapplications of text segmentation typically focus on the identificationof segment boundaries.

The approach underlying the present invention differs from reportedapproaches in several ways. First of all, the devices are not related toapproaches aiming at identifying structural text segments correlatingwith the author's segmentation of text into sections, paragraphs, orother units that may be encoded with structural codes (e.g. in XML).Secondly, the devices are not concerned about reaching a result in whichall sentences in the text are constituents of a text segment and wherethe segments are seen as contiguous. Thirdly, the aim is not to localisetext segments in order to select a set of sentences or salient topicalmarkers that are consolidated in a kind of text summary.

The purpose is first of all to identify certain areas in the text inwhich there are ‘more lexical cohesion than in other areas’ and wherethere at the same time are other specific types of markers andconnection between sentences. The assumption is that by capturing textzones with features as specified in the set of criteria, these zoneswill serve users that for some task-related reason have to explore, readand interpret the texts. Particular devices direct the users' attentionto these areas not only because they various zones indicate thematicshifts, but also because the zones reflect a kind of thematic density.Particular devices generate traversal bonds between zones with specificfeatures and these features are made concrete and offered to the user ina set of predefined search operands, preferably displayed in the textsounding board. When the user navigates along these bonds this may causeher to get aware of central themes and also as intersected withsub-zones indicating for example discourse elements in a text. Areaswith ‘more lexical cohesion’ than other areas and with a zone densityabove a certain intratextual threshold value, will indicate that thewords in lexical chains passing the zones are not only mentioned oncebut actually form a part of the theme dealt with in the zone. Thematiccontinuity and thematic discontinuity and other types of continuity anddiscontinuity are the main issue in the present invention's device forzonation. The particular issues are elaborated in the section ‘ZonationCriteria’.

The Zone Identification Device, in its plain form, requires that thetext is pre-processed by at least a POS-tagger and that the words in theannotated files also are normalised into lemma form. The preferredembodiment of the present inventions rests on texts pre-pre-processed bya Constraint Grammar tagger (CG-tagger) known in the prior art. Thedatabase partition ‘DBP Information Word’ contains a wide range ofgrammatical information attached to each word occurrence, and inaddition several attribute types containing derived values, i.e., outputfrom other devices. The device operates on the constituents of thedocumental logical object type Sentence, and the device requires noother types of structural information about the text.

Information about plain lexical cohesion points for each sentence pairin the text is registered in a diagonalized matrix stored in MAFS, whichother devices access and manipulate along several dimensions. Since eachsentence is compared with all the other sentences in the text, thematrix will also represents ‘long-distance’ similarities betweensentences. During the plain zone identification (a cluster analysisperformed on grammatically derived information), several types ofinformation about the sentence similarities are registered in particularzone link sets and these files are further processed in weighingprocedures in accordance with the various types of zonation criteria.The zonation procedure is preferably tuned according to text genre.

A previous examination of governmental reports revealed that the numberof chapters and number of sentences within each varied greatly. Thenumber of chapters varied from 4 to 27, and in some extreme cases, thenumber of sentences in individual chapters varied from 3 to about 1250sentences. This certainly calls for a device that calculatesintratextual threshold values applied in the zonation procedure. It isnot possible to apply the same threshold values for extraordinary longtext and for extraordinary short text. It is important to note that achapter that contain only 3 to let's say 15 sentences, in the case ofgovernmental report, usually contain important sentence, i.e., thesentences mediating the solution proposals or solutions selected(decisions). The device for text extraction is specialised in that itdelivers information about the ‘short chapters’ because these chaptersare a good starting point for determining central themes in the report.The reason for this is obvious—the words and phrases in sentencesmediating the governmental focus may strongly indicate superordinatesubject matter. However, seen from the perspective of users confrontedwith these texts with reference to some instruct, it may well be thatother parts of the text are of more value, as for example the counterarguments mediated on the behalf of other actors. Conflict anddisagreement as reflected in language use is often under-communicated,and for user communities of for example lawyers, indicators ofoppositions are of high value. Typical foreseen user needs ground theemphasis given to an apparatus for zonation which embodies a method andsystem capturing both thematic fluctuations, grammatical patterns,indicators of discourse elements, and intersected with word listscontaining information about for words and phrases judged to be ofimportance in the user community.

An important device in the apparatus for text zonation is the device forcalculation of scores or values assigned to each of the connectionpoints between sentences. A connection point has a compound identifier,the identifiers of the two sentences compared. A set of score constraintrules are regulating the calculation device according to parameters suchas language, text genre, text length text annotations, statistical textfeatures (average sentence length, number of sentences, etc). Theconstraints applied cause a tuning of the procedure, and preferablythese tuned versions are documented in the IRMS. The documentation ofany tuning of any of the devices are essential for reuse and the abilityto pick ‘the right tool set’ in each situation involving a new documentcollection (text genres) and user community. The calculation of scorefor connection points between pair of sentences can for instance bestrengthened for words that belong to certain grammatical classes orwords or concepts that previously are registered in a User Profile. Thecalculation procedure preferably extracts a set of rules from the ruleset specified as Zonation Criteria. The rule set denoted as [CriteriaPragrnatic User Profile Spin-Off] gives a prescription of how to extractwords or concepts from a User Request previously ascertained asconvenient by a user or user group.

DLOT Zone

A zone is a derived documental logical object type. A zone is defined asa text span consisting of at least two neighbouring sentences with aconnection score above a certain intratextual threshold value. A zonemay also embed subzones according to specialised criteria such asthematic variations, grammatical information, discourse elements,important words, etc.

Zones are considered as horizontal virtual layers superimposed on theunderlying texts and visualised in accordance with certain rulesdirecting a preferred ergonomic display also taking into account modernprinciples of HCI.

A zone as seen in the present invention, may include ‘in-between’sentences without the lexical cohesion property. These sentences mayhowever relate to the surrounding sentences according to other criteria.That is, the present invention departs from the issue of contiguity. Azone is a text partition or area created for a particular purpose—or inother words—a zoned text span (sentences) being distinct fromsurrounding or adjoining parts (sentences). The zones can be said tohave an ‘aboutness’ and this ‘aboutness’ may appear in other zones (withshort or long distance between the zones). The other zones, notnecessarily adjacent zones (or sentences in between zones), may reflectthe same or related thematic issues and at the same time reflectthematic discontinuity. In addition there may be ‘extraordinary zones’characterised by having no connection points between adjacent ornear-adjacent sentences, and possibly without ‘long-distance’similarities. Such extraordinary zones' can for example signal quotes indifferent languages, or quotes from particular genres as laws, etc. Inthe present invention zones are a kind of ‘text block’ and when realisedas a special-designed focusing device, the zones provide a ‘window ontotext contents’ and gives the user a possibility of getting aware of the‘aboutness’ of text.

A zone is defined as a text span consisting of at least two neighbouringsentences with a certain degree of connections between them (includinglexical cohesion features). The present invention's concept ‘zone’ isnot related to the more common concept of ‘text segment’, which normallyare considered to form contiguous parts of a text. A multitude ofresearch reports describes cohesive lexical links as manifested throughlexical repetition, lexical substitution, co-reference, paraphrasing,etc. The present invention does not impose the restriction that thezones have to be connected along a boundary or at a point or that thebonds generated between them follow a certain sequence. Zones areidentified across sentences and some zones may match author-determinedlogical object types as paragraphs and include or do not include sectionheadings. The match between zones and author-determined segments in atext is not an important issue in the present invention.

Zone Link Set

The files denoted ‘Zone Link Set’ contain the link sets for allsentences processed in the device for Zone Identification. When thescores are calculated, weighed and tuned, the files are consolidated inDBP Information Zone.

A particular interconnected device computes connection points betweensentence ‘S’ and ‘S+1’ as CP (S, S+1), where S ranges from 1 to thenumber of sentences extracted from the text (eof) minus 1. A connectionpoint is the score for number of words in each sentence that is relatedto words in other sentences and according to the score specificationsgiven in the set of zonation criteria.

Each point CP (S, S+1) is a candidate for a zone border, and aparticular device examines the scores (weighing scores) before they aretransmitted for further processing.

S is a variable-length vector listing the word's position within asentence, the word, the word's lemma, grammatical information, and othertypes of information depending on the zonation depth (for instance theword's semantic category or codes referring pragmatic criteria). Thescores for the connection points are recorded in a diagonalized matrixfor each text.

For example:

-   ((Sentence-ID=23) {(pos-1, government, government, noun singular,    det, subject), (pos-2, disapprove, disapprove, verb present, _),    (pos-3, Statoil, Statoil, noun proper, object)})-   ((Sentence-ID=24) ((pos-1, Statoil, Statoil, noun proper, subject),    (pos-2, disapprove, disapprove, verb present, _, (pos-3, government,    government, noun singular, det, object)})-   SCORE (23, 57)=5 (two sentences that are not adjacent, but with    registered similarities weighed according to the zonation criteria.

1 for government +1 for Statoil, 1 for disapprove, and +1 for verb inthe present, +0.5 because government is in the subject position insentence 23 and in the object position in sentence 24, +0.5 becauseStatoil is in the object position in sentence 23 and government is inthe subject position in sentence 24. The value 0.5 is a simple weightmeasure. Each class of criteria includes simple rules for adding weightto the scores (scores for each connection point). In this example theweighing rule is shown as simply adding values to the connection point'sscore depending on criteria. For details, see section ‘ZonationCriteria’.”

Identifier Zone

A zone identifier is compound and consists of the edge sentences'identifiers, i.e. the first and last sentences in a zone.

Zone Border

A zone border indicates a transition point or a ‘discontinuity’ betweenone zone and adjacent zones, or between one zones and enclosed zones,i.e. sub-zones, or overlapping zones.

DBP Information Zone

Zones provide ‘virtual horizontal windows’ superimposed on theunderlying text and reflect aspects of the texts' features under thenotion that text is an interpretative medium.

The DBP containing Zone Information encompasses all relevant informationabout the zonation criteria applied during processing of particulartexts. For each zone, the system registers the sentence identifier andthe number of sentences in the zone (frequency measure used in one ofthe filtering options). The DBP Information Sentence is interlinked tothe DBP Information Word Occurrence. The DBP Information Zone includesall information applied during the generation of zone traversal paths,i.e. bonds that interconnect zones according to preferably userspecified criteria. The DBP provides the basis for identifying what'snew in a zone as compared to a preceding interlinked zone (ZoneTraversal Path Default) or between two zones ‘preceding each other’according to user requests (Zone Traversal Path Adjusted). The fileswith zone link sets are consolidated into this DBP, which is vital forthe filtering options that regulate the content displayed in the textsounding board.

Device Zone Density Calculation

This particular device calculates the density of each chain crossingeach Zone.

The density of a chain in a zone is defined as: the number of members(word occurrences) from a chain appears in a zone divided by the numberof members (words) classified as belonging to one of the fourgrammatical classes appearing in the zone. The density measure canpreferably be further constrained so that it reflects only those wordsthat are in the same grammatical class as the words being a member of alexical chain.

The density is calculated for each chain in each Zone. If the text isshort or if the device for zone identification produces few or no zones,this device can be adjusted to sentence level (sentences areconstituents in a zone). A zone sentence has (of course) the sameidentifier as initially assigned to the ‘DLOT Sentence’ and the numberof sentences between first and last zone sentence is a derived value.The Sentence Length GC contains data about the number of words withineach grammatical class appearing in each sentence.

A zone sentence is input with connections described as: Zone Sentence:is a> Sentence Class, <is derived from> Zone Border, Zone Sentence First<is a>, Zone Sentence Last <is a>, in which Zone Border: <is derivedfrom> DBP Information Zone, is input to> Device Target Word Selection(TWS), Zone Proximity <is derived from>, Zone Sentence <is derivedfrom>, and where Frequency Sentence Level: <is consolidated in> DBPInformation Sentence Consolidated, <is subset of> Frequency Text Level,Device Frequency/Grammatical Distribution Calculation <produces>,Frequency Zone Level <is derived from>, Sentence Length Average <is partof>, Sentence Length Standard Deviation <is part of >.

Zone Density

Each Zone Identifier is linked to a vector (inverted list) containingdata about the density of each Lexical Chain intersecting the Zone.

Zone Identifier {Lexical Chain Identifier, Zone Density}.

The Zone Density can be further weighed by including a measure for TextDensity. Text Density is defined as the Lexical Chain Length divided bythe sum of Sentence Length GC.

The vector (one for each Zone Identifier) including the Lexical ChainIdentifier makes it possible to identify crossing chains within a zoneas well as the chains zone density. In addition it will be possible toidentify the appearance of ‘new chains’ in an adjacent zone thusindicating a thematic shift. When the user navigates from one zone tothe next zone, the ‘new information’, as compared to the chainsintersecting the preceding visited zone, may be marked (for instance bysimply increasing the intensity factor (I) in the preferred colourdisplay scheme.

Zone Proximity

Zones as marked by zone borders, which are defined by the set of edgesentence identifiers, are in the proximity of each other in differentways as listed: Zone A<encloses> Zone B, Zone A<within> Zone B, Zone A<overlapped by> Zone B, Zone A <overlaps> Zone B, Zone A <follows> ZoneB, Zone A <precedes> Zone B,

Device Score Connection Point Calculation

The frequency and distribution data are calculated for content words(covering the grammatical word classes noun, adjective, verb andadverb). The set of words classified as ‘Important Word’ is preferablyapplied for the purpose of adjusting the zone borders that areautomatically determined in the first round of the zone identificationprocedure. A specific device calculates the connection score for eachpair of sentence in the text. A set of constraint rules regulates thescore calculation process. The specific set of rules applied depends oncriteria such as: text length (number of sentences, average sentencelength), text genre (law, report, etc.), grammatical information (wordlemma, word stems or word reading), language information, and stop listsapplied (the latter depends on language). Tuning of the scorecalculation by applying various sets of constraint rules is essential inthat for example one set of constraint will perform satisfactory forlong texts such as reports, but will yield a low performance on shorttexts or texts belonging to a different genre. The constraint rulesspecify weighing functions with reference to grammatical information,e.g., matches between nouns in pairs of sentences are given a higherscore than matches between high-frequent adjectives, and so on.

Each sentence has assigned a Sentence Identifier (part of the AttributeType set attached to sentences defined as a documental logical objecttype (DLOT). The sentence identifier is compound with its first elementinherited from the text from which it is extracted. The first sentencein a text (with a Text Identifier) is assigned the serial numbers sothat the identifier reflects the sentences' relative positions within atext. The score for connection points between sentences is calculatedfor each pair of sentences in the text. That is, the first sentence iscompared with all other sentences. The device generates individualmatrixes and a matrix containing the consolidated scores for eachconnection point. The criteria applied in the zonation proceduredetermine the calculation of the connection points, some scores reflectlexical cohesion features, other scores reflect grammatical informationsuch as word class and word form, and others reflect methods applied formarking discontinuities.

The score identifier is the sentence identifiers for the pair ofsentences processed, thus the score identifier for the pair of sentences1 and 2 is (S1, S2). SCORE (Current S, S+1 (until EOF)). IncrementCurrent S (get next sentence while not EOF).

Device Zone Weight Calculation

The device for the calculation zone weights operates on the DBPInformation Word (related to the DLOT Word). The importance of a word(or phrase) contained in a text is determined first (importance iscalculated based on frequency information and weighing functions, bothcalculated intratextually and then consolidated intertextually). Theweighing function takes into account the notion of words unifying thetext (general and high-to-medium frequent words), and diversifying words(specific medium-to-low frequent words). The weighing function alsotakes into account the distance between words being general and specificwords in cases where the specific word is a concatenated word and itsconstituents are related to neighbouring general words. The weighingfunction operates iteratively on DBP Information Word and words withcertain syntactic functions, or other types of grammatical information,can be assigned a higher weight as a preferred reduction strategyinfluencing the content in the text sounding board.

If the text is short or if the device for zone identification producesfew or no zones, the same weighting function may be applied at sentencelevel. The rules for calculating weight follow the guidelines asspecified for the zonation criteria.

Zone Weight

Some chains (lexical or grammatical) are assumed to be ‘stronger’ thanothers. The present invention includes a device for calculating the zoneweights based on the notion that Zone Density combined with ZoneDistance preferably will support options for the generation of ZoneTraversal Paths. The calculation operates on data derived from the ZoneDensity file. The Zone Identifier is composed of the identifier of thefirst and last sentence in each zone. The distance between Zone-1 andZone-2 is therefore the distance between Zone-1-Last and Zone-2-First(sentence identifier is the sentence number within each text),Zone-2-Last and Zone-3-First, and so on.

Zones with high density with respect to an intersecting chain and lowdistance to the next zone (with a density value for the same lexicalchain) indicate the weight to the word on which the lexical chain isbased. The weight reflects chains (with many members) intersecting zonesthat appear close to each other (see Zone Proximity).

The present invention applies the weighing function on a deriveddocumental logical object type, i.e., sentences interlinked through theidentification of connection features, including lexical cohesion,grammatical information, and preferably semantic information andinformation reflecting pragmatic criteria, thus forming discontinuousand overlapping groups of sentences denoted as a text zone.

An important point to be made about zone weight is that they supporttext exploration as an inside-out navigation. The ‘inside’ is thecentral parts of texts as marked out by the zones and their embeddedsub-zones. The ‘out’ is the zones with lower weight or sentences thatare not included in any zones. However, there is made an exception forimportant sentences in that they can constitute a zone based on thesentence's discourse features, i.e. of which some are related to thecommunicative acts that took place in the document's situationalcontext.

Device Chain Generation

The device for generation of chains is interconnected with the devicefor identification of text zones, in which the latter transmitsinformation about the zone link sets to the first mentioned device.

The content of the zone link sets naturally depends on the criteriaapplied during the zone identification procedure in that the content mayvary in levels of exhaustively and specificity.

The application of the different classes of criteria depends on theavailability of lexical resources know in the prior art, such as grammartaggers, domain specific word lists, domain specific thesauri, etc.Additionally and most important the application of the advanced optionsspecified for some of the criteria, depends on a decision on whether theprocedure is to be unsupervised (fully automatic) or semi-automatic withmanual intervention (validation of semantic relations between words).

The present invention separates the set of underlying criteria due tocost issues. The more advanced criteria applied, the more resourcesrequired, including the need for manual intervention and validation.

The basic method for zone identification simply recognises lexicalcohesion between words in pairs of adjacent sentences, and the wordshave to at least be annotated with tags for the four main grammaticalclasses, and preferably normalised into lemma form. This basic zonationprocedure is not sufficient for the present invention's purpose ofconstruction attention structures virtually superimposed on theunderlying text.

The reason for this is that the basic method produces too large textzones with a low discriminating value. Some very high-frequent words,especially nouns and adjectives function as words unifying the text andinfluence the zonation procedure negatively. Words with these featuresare known to cause a failure of discrimination and this also applies tozones generated with reference to general words. In that several zoneswill have some or many of the same words assigned to their link sets,these words will not discriminate ‘useful’ from ‘useless’ zones withrespect to the notion of attention structures. Attention structures areto reflect how the author's focus of attention moves across the text,and theses general words do not signify thematic variations orargumentative variations (discourse elements). The thematic andargumentative variations surround these unifying words and the presentinvention relates to a method for capturing these thematic andargumentative variations considered as overlapping sub-zones withinzones (i.e. zones enclose overlapping sub-zones, see Zone Border).

The overlaps are not only detected with reference to word occurrencescommonly organised in a continuum from very general to highly specificwords (ref Zipf law of distribution). Particular devices also detectoverlaps with respect to words classified as for example important wordswith reference to user communities, and zonation criteria based ongrammatical information such as verb tense and modality, etc.

The generation of chains can be totally based on the output given in thezone link sets generated by a basic zonation procedure, that is, onechain for each different word (word type) that are registered in thezonation procedure. This will eliminate all words occurring only once ineach text. Principally these word types (low frequency) will not bedisplayed in the text sounding board. However, exceptions are mentionedbelow.

If the zonation procedure operates on the word types and not their lemmaforms, word types with a frequency of 1 will typically cover 40-50% ofthe word types, about 75% of the word types may have a frequency of 1 to5 (Norwegian texts, governmental reports). However, attention must alsobe paid to these word types because they typically also include wordsthat in the user's perspective may be the most specially searched forand accordingly in the user's perspective have a high discriminatingability, i.e. words that within the inner context diversify the languageuse of the author. Word types with a low frequency tagged with ‘unknown’means that the word is either misspelled (not encoded in the lexicon),or it may be a ‘new’ word not registered in the lexicon. In the lattercase the word type may be highly significant for a particular user. Aprogrammed connection to lists of commonly misspelled words will supportthe identification of misspelled words. Iterative consistency checkagainst a special designed domain, and genre partitioned corpus willyield the frequency profile in large collections, and gradually it maybe possible to circle these possible significant words that preferablymust be captured. For Norwegian texts (and other languages) a great manyof the low-frequent words will be captured by a device that links theconstituents of concatenated words to a ‘core word’ that is similar toany of the constituents.

When a user activates one of the word types displayed in the textsounding board, a device for visualisation in the text pane will pick upthe current chain and highlight words registered in the chain. Mostlikely, some of the word types covered by the chains will be consideredas more useful than others within a user community and consequently adevice controlling a wide variety of filter options operate on the ‘DPBInformation Chain’. The filtering options or reduction strategies aremade available to the user as buttons in the text sounding board.

Chain

A chain is defined as the interlinked set of word occurrences in a textsharing some specified features (word type (reading), lemma, syntax,form, relative position, etc.). At the lexical level, lemma is thepreferred default representation form.

The Zone Link Set, from which the Chains are derived, containsinformation related to all grammatical classes that are involved in thezone identification procedure. The lexical chains are primarily based onthe grammatical class nouns, but can be further restricted to a specificword (noun) appearing in the syntactical role as Subject.

Chain can also be formed on the basis on word collocations, for instancea specific adjective followed by a specific noun (within a specifieddistance), or a frequent occurring noun constellation, etc. See underDevice Frequency/Grammatical Distribution Calculation.

The device for zone identification identifies and marks text zones basedon author focused information, i.e., information which an author hascommunicated through the text being of potential interest to the userwishing to explore the text content by traversing the zones. The usercan constrain the zone traversal by giving navigational instructionsthrough several options offered in special-purpose panes preferablydisplayed in the text sounding board.

Information selected by a user is denoted as user requested informationand the present invention registers the information in a User Profile.The user can reactivate a stored Profile when entering a new text in thecollection processed and prepared for exploratory discovery. In thismanner, the present invention supports a zone traversal according to theuser's stored preferences. The user can edit or discard the storedProfile.

Identifier Chain

Each word participating in a chain has assigned a Chain Identifier,which is the Word Identifier of the first occurrence of a specific wordin a text. This chain identifier is the access point to a file (invertedlist) with entries for all members of a chain.

By using the Word Identifier of the first word occurrence in a text,information about sentences (nearest inner context) text (inner context)and document is ‘inherited’ through the identifier structure.

Chain Length

The Chain Length is defined as the total number of members in the chain.For chains formed on lexical criteria the chain length is based on thewords' lemma, and differs from Word Frequency (See Frequency WordLevel).

Another reason for separating Chain Length from Word Frequency data isthat a chain may be edited (removal of members or expanding the chainwith members via the establishment of semantic relations between wordsetc). Members can for instance be removed if the member (identified byword identifier) appears in a sentence classified as Sentence Marginal.Members can be added so that the chain reflects semantic relationsbetween words. This means that the chain length will be the sum of thechain lengths of the words for which there is defined a semanticrelation.

In many research reports of the prior art, long chains are claimed toreflect major topic in a text, and in addition chain length is used as afactor that contributes to a notion of ‘path strength’. However, longchains in their simplest form is nothing else than a interlinked list ofhigh frequent words, and have a very low discriminating ability withrespect to the construction of options for text navigation or textexploration. It makes no sense to traverse let's say 550 occurrence ofthe word ‘Hydro’, (an oil company), because it is the surrounding wordsthat give ‘meaning’ to the ‘Hydro’ occurrences. However a chaininterlinking ‘Hydro’ intersected by a chain interlinking ‘subject’(grammatical function), will provide for a substantial reduction.

In accordance with the concept of inner context, the chain length(either initial chain length or the length of an edited chain) is inputto the calculation of Zone Weight. The Zone Weight however also includesthe factor of distance between chain members. Closeness between chainmembers gives a stronger indication of themes or grammatical/semanticfeatures in certain areas of the text. Or, in other words, the chainmembers' distribution pattern combined with the notion of text zonessupports the text exploration facilities in the present invention.

The user may of course choose to navigate through all the sentencescontaining members of a Chain. Or the user may activate an option thatonly highlights words that are members of a particular chain. She willthen get aware of the occurrences at the same time as she may selectother options for the display of the high frequent word's nearestneighbours, and preferably within zones. The triple track can alsopreferably be used in order to examine the nearest neighbours of highfrequent words that are classified as focused words, and preferablyconstrained to those annotated with the syntactical function ‘subject’or ‘object’.

DBP Information Chain

Content in the DBP Information Chain includes references to content inother DBPs: {Chain Identifier, Word Feature, Chain Length, FirstOccurrence ID, Last Occurrence ID, {List of Chain Members}}. The List ofChain Members is a vector (inverted list) containing the identifiers ofthe word occurrences. The Word Identifier is constructed as compound,i.e. the Sentence Identifier+Word Identifier+Word Relative Position(within sentence).

The DBP Information Chain makes it possible to trace all the wordssigning a theme and/or a feature in a text. The information is used whencalculating Zone Density and the resulting file will yield informationabout patterns of how chains intersect each other within zones.

Intersection points that co-occur through the text may indicate mainthemes, and shifts in co-occurring intersection points may indicatethematic shifts. Chains that intersect these intersection points more‘occasionally’ may indicate signs of a more fine-grained ‘aboutness ofthe text’, i.e. thematic nuances or other feature nuances.

Device Zone Bond Generation

The device for zone bond generation generates bonds, which areregistered as links between the zones' link sets (long distance textuallinks).

An important technique for data reduction is to identify text zonesimplying that some sentences are classified as more important than othersentences in the text. Besides, it is essential to take intoconsideration the user's requests and these are preferably pre-processedand accordingly adjust the generation of zone bonds. Persistent orregularly occurring user requests are stored and managed in the DBP UserProfile.

The present invention generates zone bonds based on mainly two kinds ofinput: Information derived from the DBP Information Zone and the DBPUser Profile. If the bonds are generated only according to theinformation registered in the consolidated zone link sets, the presentinvention generates a default intratextual path (Zone Traversal PathDefault), which may be further connected to form intertextual paths.

If the bonds are generated according to information derived from the DBPUser Profile or directly from a pre-processed User Request, the ZoneTraversal Path is adjusted to the user's preference (Zone Traversal PathAdjusted). For instance, if there is identified 45 text zones in a textbased on the Zoning Criteria, the user will have the possibility to onlyvisit, let's say’ 10 of these zones that in some way match informationgiven in the User Request.

Zone Bond

The concept of ‘Bond’ has a particular meaning in the present invention.Bonds are superordinate chains defined over zones or sentences thatintersect several types of features as registered and assigned to theattribute types attached to the derived documental logical object type.The consolidated information is managed in DBP Information Zone. Thenearest idea association is that of a ‘track’ or ‘furrow’ the user canfollow when navigating through the text. If a user approves one of thegenerated Zone Bonds, she may store it as a part of her User Profile.

The concept of bond in the present invention is used differently formthe use commonly seen in research literature. The concept is often usedin order to describe certain types of cohesive features existing betweentwo adjacent sentences.

Identifier Zone Bond

The Zone Bonds are generated dynamically. Each bond has associated aninternal identifier, which is the entry point to a vector keeping trackof the set of zones interlinked in a traversal path.

DBP Information Zone Bond

A particular device constructs bonds between text zones. The thresholdvalue for bond establishment is determined intratextually, and severalbonds may intersect each zone.

Bonds may be determined in advance (pre-processing of texts based onword occurrences in the text zones).

Bonds may also be generated dynamically based on input from the userwanting to explore the text content (pre-processed based on user focusedinformation).

A bond is defined as sentences that are interconnected to othersentences and at least with and average connection score, preferablyhigher, and where each sentence are embedded in different zones(‘long-distance links’). The input to the device for bond generation isconsequently the database partition containing zone information.

Zone Sensor

The concept Zone Sensor denotes a wide set of filter options thatextracts nouns from text zones (as registered in DBP Information Zone),arranges the nouns in order of frequency (or in order of firstappearance or alphabetically), and/or in semantic classes (based onpre-specified criteria), and transmits the result for display in thetext sounding board. The results are stored in an intermediate file,‘Zone Sensor’ managed in the Top Layer of the MAFS. The Zone Sensor canaccordingly operate on all feature sets registered and consolidated inthe interconnected DBPs.

See section ‘Zonation Criteria’ and ‘Apparatus Filtering’.

Zonation Criteria

Zonation denotes the process aiming at identifying and marking the edgesof text spans in which bundles of sentences are arranged in zones. Thezones are derived documental logical object types, i.e. they do not‘physically’ exist in concrete terms, but they are present and can beidentified by combining multiple surface signals, grammaticalinformation, and information related to discourse elements. Theapparatus for zonation embodied in the present invention requiresannotated text files (ATF), at least including the mark-up of sentencesand POS-tags. Preferably the zonation procedure shall operateautomatically and unsupervised, but this on the other side dependssomewhat on the availability of foundational resources that must bebalanced in a cost—benefit perspective. Whatever supplementary resourcesavailable, the zonation is regulated by a set of clearly defined andapplicable set of Zonation Criteria. The set of criteria incorporate arather wide range of criteria that for the sake of clarity is groupedinto four broad classes, here denoted as ‘Grammar based zonationcriteria’, ‘Semantic zonation criteria’ and ‘Pragmatic zonationcriteria’. On the other hand, many of the individual rules relate tomore than one class, especially they all are to some extent, related tothe pragmatic class. This is due to the stance that the presentinvention emphasises the attainment of practical approaches but yet inbalance with the principles of text drivenness and the quality of theattention structures generated.

The three classes, each with subclasses reflects initial concerns andspin-offs from empirical testing and validation. Some of the criteriaassume background knowledge about the texts' domain, others relate toreflections on compositions of texts and variations with respect to textgenres. The pragmatic approach is also influenced by experience in howtexts are processed in relation to interpretative tasks asinvestigations, inquiries, etc., i.e. working methods prevailing ininformation-intensive organisations. The pragmatic approach is alsoinspired by reflections on the etymological aspects related to the word‘text’. ‘Texture’ can metaphorically be conceived as a ‘weave ofmeaning’ with closely interwoven constituents. The texture embodiesitems that are directly perceivable from the surface characteristics,and also items not being as perceivable as lexical signals. The lattergroup ‘do exist’ and therefore can be registered and captured bycomputer systems, and further manipulated so that the ‘deeper structuresof texture’ are made more explicit and where visualisation facilitiescan make them appear at the surface level.

The present invention embodies a device related to a set of reductionoperations applied iteratively on the database partition (DBP) withinformation about each word occurrence in the texts. The device exploitsthe frequency information and grammatical information consolidated inthese files and from which the derived results are transmitted back tothe next cycle involving the application of reduction operations. Thefinal word lists are transmitted to a device for inclusion in the textsounding board, from which the user can explore the word sets organisedaccording to the information exploited in the device for reduction.

Zonation Criteria Grammar Based

Zonation Criteria Grammatical are the criteria relating to words or thevocabulary of a language as distinguished from its grammar andconstruction, or relating to a lexicon or to lexicography. This class ofcriteria is further divided into two broad subclasses:

Criteria Lexical Frequency and Distribution

Information about each word's (canonical form and lemma form) frequencyand distribution, both intratextually and intertextually, is applied inmany of the devices embodied in the present invention. The ‘importance’of a word (or phrase) contained in a text is determined by combiningquantitative information (frequency, distribution, weight), grammaticalinformation and semantic-pragmatic information.

Words that are classified as focused words, initially based on metricdata can be chosen to affect the reduction strategies and may be used inorder to constrain the zone identification (intratextually).

Criteria Lexical Cohesion

The input file (ATF) contains information about each word's grammaticalclass and its lemma form. This information is utilised in the plainzonation procedure that registers and calculates similarities betweeneach sentence and all the other sentences in the texts. The result istransmitted to a diagonalized matrix in which each connection pint marksthe score between each pair of sentences. The score is further adjustedand strengthened in accordance with many other criteria manifested asconstraint rules.

Zonation Criteria Syntactic are the criteria relating to, or accordingto the rules of syntax or syntactics. Syntax denotes the way in whichlinguistic elements (as words) are put together to form constituents (asin phrases or clauses). Syntax information manipulated by the devices inthe present invention is first of all related to nouns and TAM (Tenseand Modality) and information related to verbs relative position inadjustable distances (‘open operands’) from nouns in the syntacticsubject position in sentences. The present invention will preferablyadopt advanced syntax patterns in sentences in accordance with resultsfrom genre specific tests and validations. The grammar-based patternsare applied iteratively in reduction strategies denoted as ‘filteringoptions’ being the building blocks directed by the apparatus forfiltering. Testing and validation of advanced grammar patterns invarious collections of genres are one of the most cost-driving factors.

The present invention rely on cyclic application of small buildingblocks transmitting the result to intermediary files that are examinedand combined by particular devices which in turn transmit the results toweighing procedures and in the end transmits the result to partitions inthe text sounding board.

Criteria Lexical Chain

When a text is parsed by either a POS-tagger or a CG-tagger known in theprior art, it is possible to fully automatize the generation of lexicalchains that interconnects all repetitions of the same word type. It ispreferred to apply a tagger that includes information about the words'lemma, which improves the zone identification procedure considerably. Anautomatic generated lexical chain does not aspects of the semanticrelations between words that occur throughout a text. It is well knownthat authors, seemingly dependent on genre, tend to avoid repetition byusing a variety of noun phrases (among others) to refer to the samenotion. The present invention applies a specialised target wordselection procedure in order to identify semantic relations betweenwords that are in a near distance from each other. The point is not todefine or declare explicit semantic relations between words, but tostrengthen the zone borders, if necessary.

The criteria adopted and (pragmatically) adjusted in the presentinvention are commonly described in literature within the field of textlinguistics. The theory postulates the general assumption that bundlesof repetitions indicate a form of thematic unit.

Lexical chains based on lexical cohesion do reflect some of the words'repetition patterns (a lexical chain is in fact a distribution plotmanifested as an inverted list containing the words' identifiers). Thelength of lexical chains is of course dependent on the text length,i.e., for short texts as texts, e-mails, memos, notes, etc.; it mightnot be worthwhile to generate lexical chains reflecting repetition. Inthis case the text will neither have any text zones that aredistinguished from other parts of the text if lexical cohesion featuresare the only criteria applied for the construction of attentionstructures. There will always be some kind of connections betweensentences in short texts, therefore one option is to accept zone sizesof for example only two sentences, and add a weight to the zone withreference to the text length.

A particular device examines the distribution of chains within a text,and determines their intersection points. Text exploration is promotedwhen the user follows one lexical chain or combines several chainsduring text traversal. In fact the identification of text zones is basedon the same principle, but refined according to a wider set of criteriaincluding in particular semantic-pragmatic aspects as for examplerelated to the notion of discourse element indicators. A text zone canin its basic form be considered as a sequence of sentences in pairsintersected by several chains.

If two zones share two or several chain intersection points, thisindicates a Zone Bond candidate. This accounts intratextually as well asintertextually (in the present invention with the assumption that oneparticular text is in some way is related to other texts with referenceto situational context). See Zone Density.

The present invention utilises information registered for the chains andthe discontinuity of chains. A discontinuity is simply defined as apoint in the chain where one or a small set of sentences contains nochain words included in the current chain.

A chain discontinuity may occur within a zone, sentences in between thezone borders or between zones. A discontinuity of a lexical chain can inthe outset be considered as one sign of change in the author's focus ofattention or a departure of the theme. (Referring to the notion that asentence can be considered as theme-opener or a theme-closer). Ifseveral chains share discontinuity points and other chains start at thesame points, the indication of a move in the author's attention isstronger. In the case that a chain or a discontinuity ends at sentenceS, OR begins at sentence (S+1), add the weight 1 to SCORE (S, S+1).Further detailed specifications are included in the program of thepresent embodiment of the invention.

Criteria Syntax

If a word (noun) is in the Subject position in sentence ‘S’ and in theobject position in sentence ‘S+1’ (or adjusted to distance 2 or 3between sentences), or vice versa, the weight 1 is added to the score ofthe connection point between the two sentences.

Constraint: The sentences should be adjacent or near adjacent (distancemax 3).

This set of criteria covering for syntactical functions is wide. Theexample above is included just to illustrate the idea of how syntacticalinformation combined with information about the word occurrences can beutilised in order to strengthen the zone borders. Likewise such rulesare applied in order to assign higher weights to zones in order todiscriminate between zones with link sets defined over the same wordoccurrences. This simple rule reflects a general assumption thatsubjects and objects may be considered as more author-focused than otherwords. General knowledge about textual patterns are adopted and adjustedin several devices that tune the scores by intersecting zone link setsdefined over various types of grammatical information. The devices forzone identification generates link sets for each type of information andthese link sets are transmitted to the device that calculates the scores(see Device Score Connection Point Calculation).

As a simple example: In three governmental reports, words as ‘oil’,‘gas’, and transport’ are high frequent. If sentence number 10 to 40 inone of these reports all contain the word ‘oil’ in one of itsgrammatical forms, this would possibly yield an ‘oil-related’ zoneprovided that the sentences are linked by other lexical signals.Assuming that two of the link sets include the words {oil, transport},then the link set for the pair of sentences in which the words areclassified as subject or object gets a higher score. Likewise, if theword ‘oil’ re-occurs in either the subject position or object positionin sentence 20 to 30, this embedded sub-zone will be given a higherweight assuming that it more strongly indicates that ‘oil is discussed’.If the sub-zone also includes important words, such as actors classifiedas important, preferably with reference to a user community, and severalverbs are in the present form, this would likewise add weight to thezone.

Criteria Subject Omitted

If the subject is omitted in sentence S+1, add the weight 1 to SCORE (S,S+1).

If the subject is omitted in a sentence, this may indicate that thesentences is ‘dependent on’ the previous sentence.

Word List

Word lists intended to serve the purpose of precise diversificationrequire, for each new domain and preferably also user community,procedures for manual intervention and validation. Each step involvingmanual intervention adds costs to the application however, by persistentexpansion and tuning with respect to domain and/or typical requests in auser community, the lists will add value to the applications as seenfrom the perspective of performance and benefit. Many years past, it waspredicted that the future would be characterised by ‘the organisation ofknowledge specialists’. At present in the year 2002, generally speaking,the greater part of information is mediated through documents (widedefinition of documents with a diversity of enclosed object types due todigitalisation of all data types such as audio, video, etc.). Thepresent invention will preferably install a programmed connection to apreferred embodiment of a special-purpose device supporting the need formanual intervention and validation in order to reach a higher degree ofprecision with respect to the words' discriminating value in the words'inner and outer context.

The device that generates word lists operates on DBP Information Wordand outputs various types of word lists that are utilised by otherdevices for filtering purposes. The word lists can be reduced to includeonly words within particular types of sentences, particular zones withsome shared features, etc. If the text structure is properly XML-tagged,the device can also construct word lists with words that occur in thefirst sentence of all paragraphs. A complete word lists minus stop-wordsconstitute the free-text index. A grammar based word list is lists ofwords constituting one particular class, e.g. nouns and verbs. Grammarbased lists are used in various consistency checks.

So-called ‘stop lists’ usually contain very high frequent words. Modalverbs are typically very high frequent in some text genres, and forinstance sections with a high score for the form ‘shall’ may signallegislative texts; whereas the form ‘should’ more often signals anargumentative text zone. Similarly, high-frequent adverbs and adjectivescharacteristically may cause a modification of the meaning asinterpreted by a human judge (during reading). For instance, a requestpattern like [adjective followed by adjective with proximity=1](adjacent words) and where the first adjective is in its comparativeform, may signal an assessment, no matter what the word occurrence are.If several word sequences match this pattern within a limited text span(a stretch of, let's say 20 sentences), a procedure can locate this textzone and visualise (by the use of colours) this zone as an ‘indicationof evaluation’. That is, many words that within the setting ofInformation Retrieval technology may be considered as ‘uninterestingwords’ are highly interesting within the application area of the presentinvention.

Word Fan Structure

In a very reduced perspective, language use may be described as havingto broad classes of words (especially as related to nouns and verbs, andlexical relations between nouns and verbs) to work efficiently. Zipfdescribed the two broad classes as mirroring two competing forces inlanguage use, that of unifying language use and that of diversifyinglanguage use. In the Zipf law of distribution, vocabulary balance occurswhere language use contains a spectrum of words from the very generalwords of high frequency and the very specific words of low frequency,and a middle range of words that balance generality and specificity invarying levels.

For concatenations that have nouns as their constituents, the presentinvention embodies a device that generates fan structures superimposedon word sets organised along the dimension of general-specific.(Explained below). The device splits the fan structure into frequencyclasses, and constructs links between words if they are related bylexical similarity between the components in concatenated words. Thewords classified as unifying language use are placed in the centre oflink sets (forming unfolding fans in the text sounding board whendisplayed from the centre and then left and right). These are typically(in Norwegian) low or middle length words and the selection criteria isthat the tagger has not classified them as concatenated and that theyhave a frequency above a certain threshold value determinedintratextually. The constituents of concatenated words, and if theconstituents are similar to words in the set of unifying words (centrewords), the constituents are denoted as convergence words and linked toword types that are equal to the constituents. The link type is either<is a> or <aspect of>, depending on whether the constituents are thefirst part or last part of the concatenated word. Fan structures aregenerated intratextually, and subsets of encoded fan structures can betransferred to cover for new texts if both sides (word types) in the fanstructures are registered as occurring in the new text.

EXAMPLE

For example, a few adjacent sentences may contain the word ‘eieraksje’and another ‘aksjeeier’ (with different meanings) and, these sentencesmay also contain the very general words of ‘eier’ and ‘aksje’. Thestructure generated will organise this word set along lines like(‘eieraksje’ <is a> ‘aksje’) and (‘eieraksje’ <aspect of> ‘eier’),(‘akjseeier<is a> ‘eier’) and (‘aksjeeier’ <aspect of> ‘aksje’). Theguiding principles for the design of these structures are, in additionto grammatical information, to divide the structures with reference tothe specific words relative frequency within each text, or the relativefrequency consolidated across text extracted from documents with definedinterrelationships. These ‘fan structures’ are of high value for a userthat enters a new text and wishes to explore the text from the generallevel to the more specific level. If the user selects the word ‘eier’from one of the panes in the text sounding board, and the system hasregistered that this general word has attached a fan structure, thepresent invention preferably will embody the display of a button (withthe icon of a ‘fan’). If the user chooses to activate this button, oneset of more specific words will be unfolded to the left (the <aspect of>set), and another set unfolded to the right (the <is-a> set). The userwill preferably get an immediate impression of what are the specificthemes as related to general themes. The present invention emphasisesthat the set of constraints exerting control over the generation of fanstructures is according to the principle of text drivenness.Consolidation of fan structures across several texts will preferablydepend on the relations between documents from a domain-specificdocument collection. The signals of diversification tend to getconcealed if there is no control with respect to the relations betweendocuments from which the texts are extracted. Without control the devicefor generation of fan structures will produce sets that probably willconfuse the user, rather than inform her about possible themes. Theparticular device of the present invention aims at informing the userabout content with reference to words organised in fan structures asfrom general to specific with respect to the current texts that the userare exploring.

So-called ‘specific words’ are often postulated as having a more precisemeaning in that they have few relevant contexts. Agreement ordisagreement regarding this issue depends on the definitions given forconcepts as ‘specific’, ‘meaning’, ‘relevant’, and ‘context’, (seesection ‘The principle of text driven attention structures’). Thepresent invention is founded on the convention in which ‘meaning’ isconsidered as an interpretative notion and that the interpretation ofmeaning differs with respect to what type of context the word appearsin. More specifically, the present invention is founded on thedifferentiation between inner textual context, i.e., intratextually,outer textual context, i.e., intertextually, and situational context,i.e. affairs in the so-called real world outside the texts. According tothe generally accepted apprehension of vocabulary balance, as describedby Zipf, the continuous range of words from highly general to highlyspecific corresponds to Zipf's distribution of word occurrences fromhigh to low frequency. The device for generation of fan structures dealswith grammar based aspects, which may be seen as a specialisationrelated to this ‘law of distribution’. In the advanced modus operandi ofthe text sounding board, the user can explore texts via the tripletrack. The individual yet interconnected panes in the triple track canbe considered as a kind of ‘moving concordance regulated by underlyinggrammar patterns’. When words are activated in one of the panes, wordsnot co-occurring with the selected word is subsequently removed from theto other panes. The size of the tracks (width) in the present inventionis regulated according to word length, and in the preferred case, itwill be possible to activate details about the words displayed. Forexample if the word in one of the panes has attached a fan structure, abutton with the icon of a fan will appear, and by activating the buttonthe user can gradually be aware of specific theme signs in theunderlying text.

Criteria Agent Process Object

One facet in the classification scheme (Subject Matter of ExpressedOpinion) requires the identification of syntactic information such asSubject and Object. A preferred restriction is that both Subject andObject are nouns (if they are not nouns and in the head of the sentence,these words registered and utilised in a device that strengthens zoneborders). By applying this rule, the number of sentences transmitted forfurther processing is substantially reduced. The main principle for thegeneration of the content displayed in the triple track is to make theuser aware of details about words with respect to the words nearestinner context. The tracks are denoted as Agent, Process and Object andgive an attention structure that preferably will not cover all sentencesin the text, most preferably for a subset of sentences withinpre-identified text zones. However, if the user considers it asadvantageous to let the triple track cover for all sentences conformingto the grammatical patterns underlying the triple track generation, theywill be offered this option. In case, this option will in some respectsgive the user a ‘total and at the same time reduced’ set of grammarbased and grammatically organised entry points (contacts) to the text.The text sounding board is based on the principle of ‘zooming in’ andzooming out’. This means that current constraints on the display in thetriple track can be loosened, or further constrained, for example byactivating discourse element indicators.

Criteria Anaphora

One factor that is often discussed in segmentation procedures known inthe prior art is the problem of co-reference resolution and themanagement of anaphoric expressions. An anaphoric expression is definedas relating to anaphora, being a word or phrase that takes its referencefrom another word or phrase and especially from a preceding word orphrase (Webster dictionary, 1996). The problem is considered asimportant in for example tools for text summarisation. The problem is toaccurately determine the preceding words or phrases that the anaphorarefers to since this affects the determination of the sentences'significance with respect to the summarisation process.

The present invention does not consider anaphora and co-referencing as aproblem in that the text zonation procedure operates at a differentlevel of exploiting grammatical information encoded in texts. Thevarious forms of grammatical substitutes are in stead regarded asbeneficial in the procedures that strengthen and condense the zoneborders. Anaphora in the form of for example pronouns or adverbs ofspecific types occurring among the first few words in the first quartileof a sentence, adds a score of 1 to the ‘Score Connection Point’assigned to the link set representing the current sentence and thepreceding sentence. The pragmatic stance taken is that an anaphoricexpression at this position refers to ‘something’ in the precedingsentence, but without trying to analyse or determine what this‘something’ is. The present invention thus exploits anaphoricexpressions by specific rules that ‘push’ the zone borders for inclusionof sentences starting with an anaphoric expression. The presentinvention does not make any attempts to identify the antecedent ofanaphoric expressions. Since the sentence with the anaphoric expressionis displayed in its inner textual context, the user will easilyunderstand what the proper antecedent is.

In a similar way, a noun in the determinate form occurring in thesentences' first position is assumed to refer to a related noun, phraseor clause in the preceding sentence, but with no need to exactlydetermine the co-reference. This type of noun occurrences adds a scoreof 1 to the link set for the pair of adjacent sentences.

The present invention to some extent treats words with an anaphoricfunction, but in a rather pragmatic way as compared to technology fortext summarisation. In case one of the types of anaphoric expressionsappear in the head of the sentence (S+1) subsequent to the currentsentence, the weight 1 is added to the score of the connection pointbetween the pair of sentences, SCORE (S, S+1). See Device ConnectionPoint Calculation.

The present invention aims at operating unsupervised and therefore onlyfocuses on a few types of anaphoric relations between sentences. Theproblems with anaphoric expressions are not sought solved by identifyingwhat word or phrase in the preceding sentence the expression(grammatical substitute) refers to. Rather, if the anaphoric expressionsappears in the head of a sentence, the present invention assumes that itrefers to the preceding sentence ‘in some way or the other’, andtherefore simply adds a weight to the linkage score between the twosentences. (‘head of sentence’ is normally the first 1-5 words of thesentence, but this unit is calculated depending on the sentence length,including the intratextual average sentence length and standarddeviation)

Pronominal anaphoric expression are easier do deal with if the nouns inthe sentences are identified and checked against a ‘known list’ ofimportant word (actors, actions, etc). These words are identified andclassified in the first processing round, and the lists of importantwords or known concepts are iteratively reapplied intratextually (apronoun refers to ‘something’ within the text being processed). Forinstance, if the word ‘government’ is the last mentioned actor in apreceding sentence, and ‘government’ in subject position), the pronoun‘we’ may refer to the ‘government’. However, texts extracted formseveral document genres differ also in this respect. The word ‘we’ mayas well signal a kind of general collective unit meaning ‘we all . . .’.

The present invention therefore treats pronominal references in the samemanner as with other types of referential expression, i.e., the linkagescore between two adjacent sentences is given a weight depending on thereferential expressions relative position within the sentence. The rulesapplied are highly pragmatic. For example, there is a rule for positionbased on if the referential expression appears in the head of a sentencepartition before the midpoint or in the head of a sentence partitionafter a midpoint. This rule is further dependent on the average sentencelength and other constraints related to author-determined sentencepartitions (clauses starting with a relative pronoun, commas, etc). Whenthe aim is to construct attention structures, it is not seen asnecessary to exactly tell the user what word or phrase a referentialexpression refers to. Rather these textual features are utilised inorder to identify, determine and strengthen text zones by adding scoresto pair of sentences identified to have other types of connections(e.g., lexical cohesion).

Criteria Conjunction

In a sentence configuration with two adjacent sentences, a conjunctionin the head of the second sentences does not imply or indicate an‘obvious’ separation. The content in two sentences could as well beexpressed in one sentence with a conjunction for instance placed withinthe third quartile. Sentence structure is a sociocultural phenomenon andgrammatical rules only apply within the sentence borders. Authors useconjunctions to join sentences or to join clauses within sentences. Ifthe head of a sentence starts with one a conjunction that by type refersto a precedent, the present invention adds a score of 1 to theconnection point covering to adjacent sentence. Ref. principlesdescribed in ‘Criteria Anaphora’:

Criteria Conjunction Sentence Head

In case a conjunction appears in the head of the sentence (S+1) andadjacent to another antecedent sentence (S).

The concept ‘head of sentence’ is very imprecise. The present inventionwill preferably incorporate a device that determines the ‘head ofsentence’ based on statistical sentence information covering eachindividual text (sentence length varies a lot across various genres,authors, etc.). If this device is not applied, ‘head of sentence’ withrespect to conjunctions, is restricted to cover the first 1-3 words.‘See also under ‘Sentence Quartile’.

The specifications for the device calculating and determining ‘head ofsentence’ is based on results from an empirical investigation of twoequal sized document collections covering text from different genres. Itwas found promising to apply information about average sentence lengthwithin each text as a means to determine the notion of ‘head ofsentence’ more precisely. Based on statistical information derived fromthese two document collections, the following rules-of-thumb illustratethe pragmatic approach.

For example: If the sentence is of average length or above averagelength and average length is more than 16, the head of the sentence isthe first sentence quartile.

If the average length is 12 (or less) the head of the sentence iscomposed of the first and second quartile.

Sentence separation is not always (or, rather never) apparent. If aconjunction is located in the ‘head of the sentence’ and preferably adevice that considers average sentence length in each text treats aconjunction treated by increasing the weight for the link set coveringtwo adjacent sentences. The value 1 is added to the score of theconnection point between the two sentences.

Constraint: The sentences should be adjacent (distance=0).

Note: Phrases including adjectives or adverbs referring to subsequentsentences are treated differently but by following the same principle.For example, if a sentence head contains ‘In the following’, this adds ascore between this sentence and the first subsequent sentence. This asopposed to for example a sentence starting with ‘further’, indicatingthat the content of the sentence is related to one (or possible several)of the precedent sentences.

Criteria Conjunction Sub-Sentence Join

Conjunctions joining sub-sentences (clauses) may cause a tuning problem.One simple tuning technique is to compute the average sentence lengthfor the sentences in a text or subtext. If a conjunction appears afterthe midpoint of sentence of average length or above, the weight 1 may beadded to the score between the sentence and the subsequent sentence, andunder the condition that the longer sentence already has some lexicalcohesion with the subsequent sentence. The two sub-sentences may beconsidered as two ‘individual’ sentences (the use of separate sentencesor the use of sub-sentences joined with a conjunction is often seen asrather arbitrary in a text, especially if the text is produced byseveral authors (various styles).

Collocation Combined

There is a vocabulary balance in texts, and the widely known Zipf (1945)claimed that the statistical regularities in language were the result oftwo competing forces of language use, that of unification anddiversification. These features are examined in detail and as related tothe words' grammatical classes in order to identify the set of wordswithin each class that will provide useful information applied by theApparatus for Filtering. In every text of a certain length, thetheoretical and empirical studies underlying the present inventionconfirm that words in the grammatical classes of nouns and verbs may begrouped into two broad classes. These broad classes are known as generalwords with a low discriminating value and specific words with a higherdiscriminating value.

Prevalent words are words that language users relate to a wide range ofpropositional content and that may be viewed upon as having a kind ofall-round feature set. These words differ in ‘meaning’ as to whichcontext they appear in (intratextual appearance), and may also differ in‘meaning nuance’ from one text zone to another in one text (a text ofsome length and where there is a certain distance between the zones).This calls for precaution for such words to be included in the lists of‘focused words' as realised in the present invention. When it is clearthat the inner context of such common words affect the interpretation ofthem (human interpretation), it follows that these words have a lowdiscriminating value. These words however, unify language use in a textand are therefore a vital constituent in the zonation procedure—the partof the procedure that operates on grammatical lemma. Due to the lowdiscriminating value of these words, the present invention employsfrequency data combined with grammatical information in order to givemore specificity to these general words. For example, collocations canbe ordered according to frequent adjectives that are precedent to nouns(with a distance up to 2 between them). These noun phrases (adjectivefollowed by noun) are often appearing as concatenations (nouns) inneighbouring sentences, for example ‘lokal organisasjon’ (localorganisation) and ‘lokalorganisasjon’ (concatenation with theconstituent adjective+noun). By splitting the concatenations into theirconstituents, the resulting lexemes will strengthen the identificationof text zones (giving a higher score to each connection point definedfor each pair of sentences in the text).

For texts in Norwegian (and in many other languages), concatenated wordsare treated by a special-designed device again utilising combinedcollocations. Specifically, the examination of distribution patternstakes into account the distance between the occurrences within a textand, preferably with well-designed constraints when crossing documentsthat do not share some features describing the documents' situationalcontexts.

Collocation Noun and Modifier

The words in chains change in various ways, and in particular withrespect to language. In Norwegian, and many other languages, adjectivespreceding nouns typically may sign changes in ‘meaning’ of the noun.

For example, the noun ‘selskap’ (company) in governmental reportsrelated to oil affairs in a rather high frequent noun. The reportscontained 950 occurrences of ‘selskap’ in different grammatical forms.The total word set was 110 337. This means that the word ‘selskap’ is ofno practical value for the user wanting to traverse the text byactivating this particular noun displayed in the text sounding board.The present invention therefore includes a device that calculatescollocation patterns with according to the words grammatical class.Further a particular device examines these collocation files in order todecide a set of grammatical request pattern that most likely will givethe best performance in the identification of chain discontinuitiescaused by changes in the nouns' nearest modifier. In order to illustrateand with reference to the example mentioned above: The combinedcollocation file, and in particular for adjectives to the left of thenoun ‘company’, revealed several different adjectives that clearlymodified the ‘meaning’ of company as an isolated term. It is worthwhilerecalling that the concept of ‘meaning’ is taken as an interpretativenotion, and consequently that the users background knowledge influencesthis judgement.

The combined collocation file revealed that 58 occurrences of the 950occurrences of the noun ‘selskap’ were modified by the adjective ‘nye’(new) in the 3 positions to the left (out of which 49 in the firstposition). The immediate association is that this text in some way dealsabout ‘new companies’, and the situational context is about problemsrelated to ‘oil company fusion’.

In order of total frequency, and frequency to the left (position 1-3left to noun ‘selskap’) in parentheses (translated):

-   {1—new (58), 2—other (35), 3—norwegian (23), 4—foreign (13),    5—participating (9), 6—international (8), 7—all (11), 8—large (14),    9—competent (5), . . . }

The re-occurring modifiers mean that changes in the particular noun'smodifier reflect a discontinuity in the lexical chain generated for‘selskap’ (company).

The device controlling the zonation applies the following general rule:

The current sentence is S: If a lexical chain shows a feature ofdiscontinuity caused by a modifier in sentence S+1 (the next sentence),this indicates a discontinuity or that the main chain has a localthematic variation that begins in S+1. Add the weight 1 to the score forthe connection point between the two sentences, (score (S, S+1)).

In a similar way, if the new thematic variation ends in the currentsentence (S) or a discontinuity ends in the current sentence (S), add 1to the score between the current sentence and the following adjacentsentence.

This simple rule strengthens the zone borders with respect to featuresas continuity and discontinuity indicating shifts in thematicvariations. For example, if several chains end in the very same sentenceand other chains start in the preceding sentence.

Let say that the current sentence is number 13 and number 13 has a scoreof 2 as related to the preceding sentence number 12, the score set is[12, 13=2}. If 3 chains ends in sentence number 13, the score betweensentence number 13 and the subsequent sentence is increased by 3 {13,14=3}. A further if 2 new chains start in sentence number 14, the scorebetween sentence number 14 and the preceding sentence is increased by 2.That is, the total score will be {13, 14=5}. Further, let say thatsentence number 14 has two chains intersected also in sentence number15, giving {14, 15=2}. Thus the score for the sentence pair {13, 14}will demarcate the edges of two adjacent zones, one ending in sentence13 and the other starting in sentence 14.

The score just indicates that there is a shift in theme, and the contentof the link set, in which the score is only one of the attribute types,include the necessary information utilised by the devices that generatetraversal paths, zone bonds, and so on.

Different rules apply for other variants of patterns stored in thecombined collocation files. See also word relations organised in fanstructures of <is-a> and <aspect-of > for Norwegian texts.”

Collocation Proper Noun

The concept of ‘phrase’ refers to a word composition of two or severalwords corresponding to one of the predefined grammar based requestpatterns. A phrase refers to NLP phenomena and in a particular usercommunity ‘phrases of interest’ may be encoded in inventory lists, listsof persons, organisational divisions, etc. Several taggers known in theprior art, seem to not be reliable with respect to the identification ifproper names. Some of the problems seem to be related to composingproper name phrases with abbreviations. A preferred combination ofproper name lists, proper name elements as recognised by the tagger, andgrammar-based patterns superimposed on the combined collocations maycapture more instances than if just exploiting the tags as they are. Forexample the set of request patterns:

-   [(proper name<distance=0> proper name)] OR-   [((proper name<distance=0> abbreviation)<distance=0> proper name)]    will give phrases such as Øyvind Enger, Nina Raaum, Brit H. Aarskog,    Persona N. Grata, etc.

Each set of request patterns is defined to have a certain searchintention, in this case to locate and register named persons,organisations, etc. The lists generated from these request patterns aredenoted as ‘phrases’. i.e. series of words in a text in accordance tocommonly known grammatical patterns.

Criteria Noun Collocation

A multitude of research reports within linguistics propose algorithms inwhich the problem related to nouns phrase co-reference resolution isconsidered as a clustering task. Many of the proposals depend on theapplication of general thesauri (specifically WordNet) where nouns areclassified into broad semantic classes. The clustering approach is basedon the generally accepted assumption that all noun phrases used todescribe a specific ‘entity’ will be ‘near’ each other, that is, theirdistance will be small. The dependence on thesauri and semanticclassification of words is considered as a debatable constraint, and inparticular when the training corpus contains short texts.

The present invention does not seek to accomplish a solution to theco-reference problem as such, but incorporates the general assumptionrelated to ‘nearness’ in a Device for the calculation of NounCollocations, (<is part of> Device Frequency/Grammatical DistributionCalculation. The noun collocations are further utilised in the TargetWord Selection (TWS) Procedure.

The present invention manages a DBP with Chain Information in this fileincludes a list of chain members (a vector), that is pointers to theoccurrences of each word type forming a lexical chain (lemma form) inthe text. Word identifiers may be used as pointer values, or thepointers refer to word identifiers, which are entries to a separatedatabase partition (DBP) for Word Information. This DBP includesgrammatical information, originally extracted from the annotated filesproduced by a POS-tagger or CG-tagger, and then consolidated due toefficiency issues. The attribute types attached to the documentallogical object type Word (the set WATOT) realized in this particularDBP, also includes an attribute type for the potential inclusion ofsemantic codes (optional). Instead of processing all words with respectto all possible semantic relations between words near by each other, thepresent invention provides for several reduction strategies. The DPBWord Information is transmitted to the device denoted ‘of DeviceFrequency/Grammatical Distribution Calculation’, which produces combinedcollocations, for example collocations that only include nouns, or nounsand adjectives, i.e., information corresponding to a preferred set ofgrammar based search patterns.

For example if the noun ‘eier’ (owner) co-occurs 120 with ‘selskap’(company) (total frequency of 870), in a distance up to 3 to the left orright, this regular co-occurrence will indicate that in this particulartext it seems to exist a theme about owner as related to company.Further, if both these words are classified as focused words, i.e., anauthor-focused word with certain frequency characteristics, the wordsare entered into a Target Word Selection (TWS) procedure. That is,‘eier’ and ‘selskap’ are examined with respect to words in theneighbouring sentences in order to determine if there are anysemantically related words. (‘eier’ (owner)->{‘innehaver’ },‘selskap’->{‘firma’, ‘konsem’, ‘bedrift’}. The validated matches areregistered as ‘TWS Domain Code’ and each code refers to the wordoccurrences in the text initially forming the target word lists. Thissecures that the semantic relations cover for relations as capturedintratextually, and that the relations are defined for words within acertain distance.

The set of ‘TWS Domain Codes’ are linked to the word occurrence WordType (a separate file referring to the text identifier and a list ofpointers to all word occurrences of a particular type). The graduallyevolving domain specific thesaurus will thus become data independent inthat it is possible to invoke that part of a thesaurus structure thatcovers for one particular text, or a group of texts. The TWS Domain Codewill be assigned to the field ‘semantic code’ in the DBP WordInformation. First of all, these relations are used in order tostrengthen the zone borders i.e., the score for the connection pointbetween two related sentences is increased. Secondly, the text soundingboard will give an option for ‘expansion’. For example, if the word‘selskap’ (company) is displayed in one of the panes in the textsounding board, the user may preferably select an option ‘displaytogether with related words’. This causes the word occurrences of thetype ‘konsern’ (concern), ‘firma’ (firm), and ‘bedrift’ (enterprise) tobe highlighted as either co-occurring within zones, or generally for alloccurrences of the whole set.

In the first run, the device for the calculation of noun collocationscan preferably be constrained to operate on nouns occurring withinsentences in a zone. Variants of Device Frequency/GrammaticalDistribution Calculation operate on the different grammatical classes,i.e. grammatical request patterns are input in a kind of filteringoperation. A lexical chain treading nouns and with the highest densityvalue (see Zone Density) may be selected as the first ‘current noun’ andthe procedure calculates the distance to other nouns forming a lexicalchain passing through the same zone. The file Zone Density includes aLexical Chain Identifier which in turn is the access point to anotherfile containing the entries to all members of a chain (Word Identifierwhich is composed of Sentence Identifier (intratextual serial number)and Word Position within the sentence)). When the distance to all nounsforming a lexical chain passing the zone is calculated, the procedurecalculates the distance to other nouns in the zone (if any (nouns thatfor some reason are excluded from the Chain Generation). In case thetext has no zones (short texts) the procedure operates on sentences (SeeSentence Density). The procedure can be constrained to only operatedwithin a distance of a specified number of words, for instance 5 words,and for all grammatical classes or specified grammatical classes to theleft or right to the ‘current noun’, within a sentence. It is of nointerest to cross sentence borders.

The procedure is not dependent on thesauri resources, but will supportthe construction of intratextual thesauri structures. Recall that thepresent invention is founded on the idea or principle that wordrelations should be established intratextually sine the word's contextconstrains the usability of general thesauri relations. However, thisprinciple does not exclude the use of thesauri look-up. If nouns beingmembers of lexical chains passing through the same zone (or near-bysentences) returns with a distance value below a certain thresholdvalue, and these nouns are listed in each others synonymy list in athesauri, the procedure will construct a semantic relation between thesetwo nouns. In this case, the Zone Identification Unit has to check onthe sentences adjacent to the identified zone borders as well as thesentences within the zone borders in order to recalculate the Zone LinkSet.

The semantic relations will be registered in the vectors that are validintratextually. The present invention serves the idea that suchautomatically established semantic relations should be calculated foreach text. Since the procedure operates intratextually, the problem ofsemantic relations not being valid across several texts will be avoided.The vectors with semantic relations are however supposed to be of valuewithin a domain-specific document collection—and perhaps especially forsets of documents that are task-related and in addition produced withina limited time span. If the texts are verified as sharing a specifiedset of features related to the situational context, the presentinvention does not oppose the consolidation of individual semantic netsacross several texts.

The density and distance data can be used as input to a concept mapgeneration unit (visualisation of concept relations). The procedure canoperate incrementally by also including sentences (calculating nouncollocations from the inside of zones towards the sentences in thevicinity of borders identified in the first run). The procedure can alsooperate on more advanced linguistic surface patterns in the form ofcompound noun phrases (possessive form, adjective preceding nouns,series of nouns, etc.). These linguistic surface patterns will beadopted from published research reports and tuned towards the purpose ofthe present invention.

The direction of processing may be influenced by language. It is forexample reported that it may be advantageous to process English texts'backwards by comparing each noun with preceding nouns and with a certaindistance. This is based on the simple assumption that any noun phrasepossibly will refer to a preceding noun, either within the same sentenceor in a preceding sentence. The present invention will preferablyperform a test to see if concatenated nouns are processed moreefficiently if the zonation procedure operates from end-of-file tostart-file tags.

The preferred approach is to first include the words' classified withrespect to a notion of importance, which is definitely influenced byrequirements in the user community (e.g. Word Signature). Furthermore,the approach considers the notion that some sentences most likely aremore important or central than other, e.g. sentences with particularactors in the subject position. Nouns occurring within zones with a highdensity value (Zone Density, or alternatively Sentence Density) may alsoprovide valuable information regarding the assumption that some words(nouns) are more important than others, i.e., they participate instronger chains (noun chains intersecting several zones).

Zonation Criteria Semantic

A text zone consists of two or several sentences that are interlinkedthrough lexical cohesion, grammatical information about word classes andword form, to a limited extent semantic information, and informationindicating discourse elements. The realisation of semantic relations isdependent on the quality of either general thesauri or domain specificthesauri. A wide spectrum of late reports (2002) reflect nuances inapproaches aiming at establishing semantic relations between words,within and across documents in small to large collections. Severalreports indicate that the automatic application of ‘concept spaceapproaches’, ‘vocabulary networks’ and ‘general thesauri’ meant to coverseveral domains have not shown the results as expected in terms ofperformance. The explanation to the rather low performance levelannounced in reviewed reports may relate to problems with thefoundational theoretical framework. Several reports convey a rather weakunderstanding of the concept of contextuality with respect to text.There seems to be no differentiation between the text units' innercontext, outer context, and situational contexts, at least this coversfor approaches based on advanced mass-computation of statisticalproximity measures. Many reports convey an over-optimistic belief thatby determining an external system of interconnections between words,this may provide for better precision in the users' search effort andreduce the problem of so-called information overload. The criteriasubsumed under the class Semantic relate to ‘meaning’ in language use.The assignment of semantic relations between words is strictly governedby a set of constraint rules in the devices embodied in the presentinvention. An extensive encoding of semantic relations between wordswithout taking into account the difference between intratextuality andintertextuality, may cause a situation of ‘semantic overkill’. Even arestrictive establishment of semantic relations requires access tothesaural resources, and preferably resources with a rather simple andcontrolled structure. The restrictive approach is performed and ruled bya particular device for target word selection. The device embodies amethod and system for the construction of domain specific thesauri withsubsets virtually attached to particular texts or texts extracted fromdocuments that share some features from the situational context.Evolving domain specific thesaural resources add value to theapplications of the invention in particular organisational settings. Thepersistent thesauri structures are never superimposed on texts withoutpassing the device for target word selection. The value added will be inform of lesser dependence on manual (computer-supported) intervention.

The set of constraints is primarily related to the assumption thatsemantic relations between words may hold intratextually, but notnecessarily intertextually. There are several practical reasons for therestrictive perspective on the establishment of semantic relationsbetween words occurring in a text relatively far from each other and inparticular for words occurring in different texts. This is explainedinto more detail in the section about the principle underlying textdrivenness. If one refers to the general meaning given to the conceptsof precision and recall* (commonly used measures for the evaluation ofIR systems' performance), and extensive assignment of semantic relationsbetween words heavily increases recall. The application of searchoperators that activates semantic relations between words expands thesearch space accordingly (the operator OR carried out in a concepthierarchy or net). The user may evaluate the precision as moresatisfactory when such semantic relations are activated, however thereis another important concern—that of the user's futility point. Higherprecision may be of less value if the recall exceeds the user'scapability (resources such as time) of investigating the number ofelements in the result list. This effect is widely known as related tothe ‘scaling problem’. Semantic relations between words conflict withthe reduction strategies that form an important part of the presentinvention's ‘awareness’ perspective. *Measures as precision and recallare undoubtedly of interest when evaluating the performance of IRsystems in closed laboratory experiments, but are not the right goals topursue in the perspective underlying the present invention.

Some text zones may be highly specific and of significance to usersengaged in interpretative tasks. Zones with specific content can be‘overruled’ by simply adding weights to connection points betweensentences with assigned semantic relations. The old measure of ‘fallout’refers to an essential aspect of all methods and systems related tomaking textual content available, either via search engines, or as inthe present invention, as representative samples presented forexploration in a text sounding board. The term ‘representative’ bydefinition also embraces extraordinary specific text spans.Extraordinariness must be treated as certain type of feature reflectinghow the author's focus of attention moves across a text. Therefore zonesenclosing sentence pairs with a low score for connections points orzones with low weight score, either on the lexical or semantic level orboth, should not be neglected. Low scores can be treated as exceptionalcases by simply providing for an option that displays zones registeredas extraordinary.

A particular device applicable for many languages embodies a method fortreating concatenated words and their constituents. The resultingstructure is denoted as ‘Word Fan Structure’ in which the centre word isa focused word and the constituents can be unfolded on both sides viasemantic-pragmatic relations denoted as <is a> and <aspect of > (not tobe understood as absolute terms). In the present embodiment theprototypical version of this procedure operates on texts in Norwegian.The procedure can be adjusted for texts in other languages withconcatenated words as one of their features, and for English preferablycombined collocations will provide for the necessary support (especiallyfor nominal expressions. i.e., a word or word group functioning as anoun).

Criteria Synonymic Relation

It is commonly recommended to allow only one level of semantic relationsas encoded in thesauri. For instance, if ‘company’ is related to ‘firm’and ‘firm’ is related to ‘enterprise’ and ‘enterprise’ is related to‘business’, only the relations from ‘company’ to ‘enterprise’ should beallowed for inclusion. That is, if a to b, and b to c, and c to d, thenonly a to c is allowed. If the user transmits a request with the term‘company’ a system for automatic query expansion can include theinterlinked terms, either as a category search or iteratively. Theunderlying assumption is that these links will refer to words supposedto have a related ‘meaning’. Automatic query expansion in this generalsense aiming at expanding the search span is not an issue in the presentinvention. The reason for this is shortly explained in the section ‘Theprinciple of text driven attention structures’.

In the present invention, to the extent general thesauri are applied, arelated kind of preferably automatic linking between different wordswill be applied with great caution and constrained by already identifiedtext zones. A zone identification procedure that solely operates onlexical items often results in zones covering too many sentences, i.e.they are too continuous either because series of sentence pairs have thesame connection score, or because enclosed sub-zones have weak borders.The use of lexical resources as general thesauri is therefore applied ina second round of zone identification, i.e., a round that involves zonestrengthening. The device for Target Word Selection (TWS) operates on asubset of the DBP Word Information, i.e. the set covering sentenceswithin zone borders (derived from the file Zone Border) plus a minorexpansion of 2-3 sentences adjacent to zone borders. If the TWSprocedure returns a validated TWS code, this code is assigned to theattribute type ‘Word Semantic Code’ which is an optional element in theDBP Word Information. The device for zone identification operates on therevised DBP resulting in a revised Zone Link Set in which the score forconnections points reflects validated relations between words insentences already identified as interconnected. For short texts withoutany zones, the distance can be constrained to a certain number ofsentences in a sequence. See Criteria Noun Collocation.

Semantic relations between words will be realised, first of all,intratextually and under the constraint that the words must appear‘near’ each other up to a certain distance, an preferably within zonesreflecting plain lexical cohesion. The resulting ‘semantic nets’ asencoded in a thesaurus structure covering each text individually and toa certain extent consolidated to cover for related texts, materialisesthe underlying conception of ‘context’ as different with respect to‘inner context’, ‘outer context’, and ‘situational context’.

The present invention incorporates a detailed classification schemedescribing semantic classes of words or phrases in a system organised asfacets. The facet structure is designed with reference to speech acttheory, juridical norm theory, workflow considerations, and grammaticalconstructs. The classification scheme is elaborated in Aarskog (1999).The particular scheme also conceptualises the communicative aspects ofthe document's situational context.

Criteria Verb Relation

The assignment of ‘semantic codes’ reflecting identified relationsbetween words in the text, is preferably restricted to nouns or certainword constellations identified via the device that produces combinedcollocations (See Device Frequency/Grammatical DistributionCalculation). The construction of semantic relations between verbs ismore discussible due to many factors. The ‘meaning’ of a verb isdependent of which arguments the verb type takes, and is also influencedby tense and modality (TAM). The arguments ‘attached’ to the same verbtype or semantically related verb types will differ from one sentence tothe next. In addition the ‘meaning’ of verbs differ with respect to theclose inner context, whether they are used in order to express acts oropinion. In the latter case, the occurrences of the same verb type maydiffer with reference to qualification criteria as sincerity condition,modality, sentence structure, and so on.

Given thesauri resources that explicate semantic relations betweenverbs, the target word selection procedure applied in the presentinvention rests on very restrictive rules with respect to the assignmentof semantic relations between verbs in the text. Instead, the presentinvention relies on a text driven construction of broad semantic classesand only if the verbs appear within the same zone, or between zones andnear-by adjacent sentences outside the zone borders. The rule inaddition takes into account whether ‘near-by’ words are substantivizedverbs, i.e. verb types have a correspondent noun type within a shortdistance, see Criteria Synonymic Relation.

This implies that semantic relations can be defined across grammaticalword classes if verbs and corresponding nouns appear within a shortdistance. The close examination of the set of governmental reportsshowed that this particular feature of language use was a regularpattern (Norwegian texts). Rules for the establishment of relationsbetween verbs depend on language. Regarding for instance theScandinavian languages, the adverbial particles (among others) have tobe taken into account when representing verbal phrases anddistinguishing between them.

In short: The ‘meaning’ of words in its textual context is diverse andcannot be fully disambiguated even through an advanced dictionarylook-up. The underlying assumption is that the constantly renewal oflanguage and language use within ‘contexts enclosing contexts’ is aproblem that cannot adequately be dealt with by the unsupervisedapplication of thesauri.

Verbs are Highly Polysemous

Verbs may signify a complexity of information. Therefore verbs areprobably the lexical category that is most difficult to exploit withrespect to the generation of attention structures. The present inventionapplies a theoretically founded, yet pragmatic approach regarding theseissues.

In the present invention the division into broad semantic classes ofverbs is based on speech act theory and the invention relies on thedefinition of a set of broad semantic verb classes due to the fact thatverbs are highly polysemous. Verbs may change their meaning completelydepending on the kind of noun arguments with which they co-occur. Inaddition, the verbs' positions within the sentence also influence ontheir meaning. Due to the existent state of the art, the presentinvention proclaims to omit the application of fine-grained semanticverb classes as inscribed in general thesauri as for instance WordNet.The broad semantic classes of verbs are preferably formed for verbsbeing in a subsequent position to nouns that occur in the subjectposition. One of the criteria applied for the division into classes arethe five primitives known from classical speech act theory, i.e.,assertive, commissive, directive, declarative, and expressiveillocutionary forces, which when expressed in explicit forms subsequentto nouns, are the simplest illocutionary forces of utterances. Othercriteria relate to physical action however, care must be taken becausesuch verbs are often used in a metaphorical sense in argumentativetexts. In the present invention the most important sets of criteria forcontent words (grammatical classes of nouns, adjectives, verbs, andadverbs) are those related to domain and genre covered by the documentcollection.

A particular device will deal with concatenated verbs (first constituentbeing a noun or preposition) and verbs added up with adverbialparticles. For example in Norwegian, (‘innkalle’ <same as> ‘kalle inn’),(‘sammenkalle’ <same as> ‘kalle sammen’), and (‘bygge ut’ <differentfrom> ‘bygge ned’). This device will adapt results from linguisticresearch within the EU research programs. (Norway is not a member of EU,meaning that the results referred to are related to languages as Danish,which is very similar to Norwegian, and German; English has anotherfeature set for verbs).

As mentioned, several of the devices in the present invention arefounded on theories that at present may be subsumed under the field of‘Integrational Study of Language and Communication’ established as aninternational association in the year 2000. The theoretical frameworkunderlies series of small-scale empirical investigations and whereresults have influenced the cycles of design, construction,experimentation, redesign, reconstruction, i.e., experimental design.

The following brief presentation gives a simple example for the purposeof illustration. One particular investigation examined the detailsbehind the distribution pattern of verbs in two equal-sized (number ofwords) collections of documents, the genres of laws and governmentalreports (in Norwegian). 5 663 different verbs were detected, of which 2514 verbs were present in the reports but not in the law collection. Thevery high-frequent verbs were typically general verbs, and by studyingthe occurrences in their inner context, they also typically served asunifying the text (as claimed by Zipf). An in-depth analysis of themedium-to low frequent verbs, resulted in a set of 1 124 verbs thatseemed to serve the purpose of diversification, and of course based onsubjective evaluation (errors such as misspellings removed, etc.).Further, the very high-frequent verbs also showed to be the mostfrequent constituents of concatenated verbs, and in which the firstconstituent was a noun. Furthermore and with respect to the innercontexts, these nouns (classified as general nouns) also appeared inneighbouring area of the verbs, either as individual words or asconstituents in concatenated nouns. Patterns were also revealed forverbs added with an adverbial particle (preposition) and nouns with aprefix similar to the adverbial particle, and the constituent beingprefixed similar to the verb. Information derived from experimentalobservations is realised as guiding rules in the apparatus for zonation.That is, since the patterns are relevant for the comparison of pairs ofsentences, it is worthwhile to construct grammar based search macrosthat capture these similarities. The particular devices capturing theseinterweaving features are of course dependent on underlying filesencoded with grammatical information, and preferably grammar taggersthat mark concatenations. The rule set positively affects the zonationprocedure. Among the 1 124 different verbs found to be of potentialvalue for diversification, about 80% of them occurred two or severaltimes in the subsequent position after nouns occurring in a subjectposition. A review of this new subset of verbs seemed to give an‘immediate sense’ of understanding, i.e. they evoked an association(subjective judgement) when seen together with the nouns in question(nouns as subjects). The next step was to further reduce, or diversify,the set of verbs by including information about their grammatical form.In argumentative texts, such as governmental reports, verbs in thepresent may sign the logical now, that is the time of the utterancemediated through text. The patterns revealed were promising, in thatthis step-wise reduction from the initial set of 5 563 different verbs,now showed a set of verbs (in the present) that to a surprising degreereflected opinions, reflections, evaluations, viewpoints, concerns,standpoints, etc.

In scene of the present invention it is important to recall thatargumentative texts and many other genres, are an interpretative mediumthat gives other data (statistics, figures, maps, pictures, etc)‘meaning’ within the situational context (explaining the reason for theproduction of the document, or the document's background information). Aspecial-designed device constructs collocation patterns that exploitfrequency information combined with grammatical information as encodedin the DBP Information Word. As for verbs, useful patterns related tothe three other main grammatical classes gradually evolved by theiterative application of the more or less same reduction operations.Some of these patterns are realised in particular devices seen in thepresent embodiment of the invention. The adjustment of these deviceswill be conditioned by results from these iterative applications ofreduction operations. These operate on information about the words'frequency and distribution patterns and combined with information aboutthe words' grammatical class and grammatical form, as well as the sameinformation about the subsequent and precedent words.

With reference to the example and description above, it is mentionedthat the set of verbs subsequent to nouns can be further restrictedaccording to tense and modality (TAM). Verbs signalling utterances (theexpression of viewpoints, meanings, thoughts, appraisals, etc) are forexample of particular interest in argumentative texts if the precedingnoun refers to a specific actor or group of actors juridical persons,either denoted individually by proper nouns, or denoted collectively).These filtering options follow modern theory within the field of textlinguistics (a field proclaimed to significantly differ from orthodoxlinguistics, formal semantics, computer linguistics, etc.). Regardingmodality, the present invention follows the theory on illocutionaryforce as signed by restricting the mode of achievement of theillocutionary point by imposing a new special mode.

For example, Werth (1999) gives a list of modals, groups them accordingto tense, thus proposing a continuum of the utterances intentionalstrength (probability, possibility, and predictability). Bundles of thesame TAM across several adjacent or near adjacent sentences may indicateone aspect of the deeper semantics in a text zone. Taken together withcue phrases indicating the superordinate rhetorical or argumentativefunction in zones, and preferably as conjunct with nouns referring tomediators (nouns in the subject position), the present inventionprovides indicators of deeper semantics that may provide usefulattention structures for professional text explorers. The theory andapproach grounding the present invention states that frequencyinformation cannot tell anything about the deeper semantics about forinstance the types of utterances, types of arguments (whether the issuesare moralistic, ethical, juridical, etc.). This statement appliesregardless of complexity in the computational statistical-proximitymodels, see the section ‘The principle of text driven attentionstructures’).

The present invention applies grammar based request patterns, and inaddition to the words grammatical class, also may incorporateinformation about the grammatical form. For instance, it shows that textfluctuate from verbs in the past to verbs in the present. An author maytypically refer to ‘something’ in the past and thereby evoke thereader's cognition about the background situation. A new stretch withbundles of verbs in the present, may indicate a shift in the author'sfocus to the ‘matter dealt with’ or ‘opinions about ‘what is’ in thelogical now time zone’ (logical now being the author's date ofutterance’.

Zonation Criteria Pragmatic

Zonation Criteria Pragmatic is the criteria that shall embody matters ofpractical affairs as seen by a user community. Elements in somediscourse models may be identified through ‘important words’ or specialcue phrases being lexical signals for such elements. The use ofpragmatic criteria requires a list of words recognised and validated asreliable cue phrases. A special-designed classification scheme includesa set of categories (with further refinement into facets) for theassignment of codes referring to text zones in which ‘these sentencesdeal with a problem’, ‘these sentences deal with an evaluation of asolution or comparison of solution proposals’, etc.”

The present invention considers information about frequency anddistribution as supportive for the diversity of weighing functions thatin most of the devices are combined with grammatical information aboutthe words in the texts selected by the user for exploration.

The frequency of word form provides useful information about the themesin a document, or the thematic profile. The nouns and noun phrases aresignals of the text's themes and sub-themes. In the present inventionnouns are grouped according to their co-occurrence patterns in adjacentor near adjacent sentences, and weighed according to the syntacticalfunction and position within the sentences. This provides for a thematicprofile of text zones where text zones are conceived as deriveddocumental logical object types. Preferably nouns, above and belowcertain threshold frequency values, are classified into broad semanticclasses in which both the exhaustivity and specificity of each class areaccording to criteria within the user community.

This is according to the stated intention that each application of theinvention is tuned towards specific user communities or several usercommunities with shared features as related to professional domain,tasks at hand involving text exploration, sizes of domain specificdocument collections, persistent goal-oriented information needs, etc.Threshold values for the noun frequency are determined intratextually(within one text at the time), and balanced or weighed against the textlength, number of nouns in the text, number of nouns in the identifiedset of text zones. The particular device that performs thesecalculations, and again with reference to the words' grammaticalinformation, produces density values.

The intratextual perspective is based on the conception of each text asa weave of meaning where the text provides the inner context. Each text(and thereby words within the text) has an outer textual context if thetext is extracted from a document that is workflow related to otherdocuments in the collection. Texts extracted from several documents thatare related (as seen by the user), can be treated as a whole, thusinvolving an intertextual processing. Documents that are related withinthe domain are seen to provide for one aspect of the situationalcontext, that is aspects of the situation or event in the world outsidethe text that caused or motivated the document to be produced andmediated.

The present invention embody several filtering options for anexploration facility denoted as ‘incremental aboutness’ and theseoptions are preferably also tuned according to requirements that reflectspecific relations between users, tasks and documents.

Device Word Classification

The ‘DBP Word Information’ is used when calculating word adjacencymatrixes with respect to certain grammatical patterns. Word adjacencyinformation is applied in the identification of re-occurring wordconstellations as well as variants in language use (collocationpatterns). The important issue is that frequent occurring collocationpatterns support the apparatus for disambiguation. Collocations with alow frequency may indicate either ‘marginal’ expressed opinions or‘central’ opinions. In the latter case, the collocation is intersectedby words classified as Word Important, for instance cue phrasesindicating problems, decisions, solution proposals, etc. (i.e., lexicalsignals referring to discourse elements or superordinate argumentativefunctions).

The application of the device in particular addresses focused word,which may be conceived as a representative for focused themes in thetext. The concept ‘focused word’ must not be confused with other termssuch as ‘core term’. A focused concept preferably refers a noun orphrase (i.e., a word constellation which is according to a predefinedgrammar based request pattern), occurring above a certain thresholdvalue for a given text. The value is determined intratextually, orintertextually for a small set of texts if they are work flow related),and where the occurrences also can be constrained to appear within textzones (occurring in several adjacent sentences identified a zonationprocedure that also takes into account variations in wordings). Thefocused word is seen to reflect ‘author-focused information’, and theidea is to lead the users' (readers') attention to these. Notions like‘core theme’ or ‘important word’ are rather diffuse because whatobviously will be an important word or core theme to one user, need notbe regarded as such to another user. Since the device applies atext-driven approach (leading attention to what actually occurs in thetext), the user is free to select from a displayed list of focused wordsaccording to their own perception of what is essential in the specificinformation-seeking situation.

The device for word classification produces information utilised in theconstruction of specific-purpose thesauri with arrangements of‘important words’ or ‘cue phrases’. The concept ‘cue phrases’, here,refers to for example explicit linguistic signs of discourse elements(mapped onto a specific set of criteria applied during theidentification of particular text zones), for instance words signallingor indicating that utterances have specific superordinate argumentativefunctions, (see Device Zone Sensor). The classification of cue phrasesnecessitates manual intervention and validation, and the effortsinvested into this type of thesauri construction depend on the issue ofCost-Benefit-Performance within each user community. However, generaland explicit words signalling problems, conflicts, risks, etc., aretypically abstract words being of value in filter options applicableacross domains.

Cue phrases systematised with reference to discourse elements can beassociated with notions as ‘classes of meanings’. The device for targetword selection can be applied under the assumption that ‘classes ofmeanings’ as related to cue phrases, also may tend to repeat inneighbouring sentences. For example, the very general word ‘problem’ isan entry in a general thesaurus. This word is picked up as the currenttarget word and the TWS device examines all the link sets for zones anddisplays all content words in the adjacent sentences. The set of wordsdisplayed can be further restricted to the words in the same grammaticalclass as the entry word selected from the general thesaurus, i.e.,Current Target Word), and/or combined with information about lemma orword stems. This particular device thus supports the needed manualintervention and validation reflecting the words inner and outercontext. The identification of cue phrases reflecting argumentativefunctions or discourse elements may also be based on matching accordingto grammar based request patterns.

EXAMPLE

A simple grammatical request pattern as (adjective followed by noun)will produce ‘statlig selskap’ as one of its results (state company,i.e., two nouns in English). In the collocation patterns covering 23governmental reports related to oil affairs (when expanded withgrammatical annotations and filtered accordingly), the file shows that‘statlig’ co-occurs 117 in the first position to the left of ‘selskap’.Since the grammatical word classes in ‘statlig selskap’ conform to aknown simple grammar pattern, and due to the co-occurrence frequency,‘statlig selskap’ will be determined to be a phrase, (and treated as a‘single term’—that is a contact to the underlying text). Thus thecombined use of grammatical information and frequency of proximityinformation increases the precision regarding the determination offocused words or word constellations in the texts.

In the Norwegian language ‘statsselskap’ may be considered asnear-synonym to ‘statlig selskap’ (a paraphrase). A grammar parser willrecognise ‘statsselskap’ as a concatenated word (‘sammensetning’) withthe two lemmas ‘stat’ (noun) and ‘selskap’ (noun). A stemming of thephrase ‘statlig selskap’ will produce the set ‘stat’ and ‘selskap’.Thus, if one sentence contains the phrase ‘statlig selskap’ and a nearby sentence (distance can be regulated) contains the word‘statsselskap’, the zonation procedure will capture these and add ascore to the link set covering the pair of sentences. Other variationsthat may occur close-by, are for example ‘statens selskaper’,‘selskapene til staten’, which also will be captured by iterativeapplications of grammar based request patterns.

Small variations in ‘meaning’ is ignored when the goal is to locate ordetect a relative small set of documents out of a collection comprisingmillions of documents. In the present invention the goal is quitedifferent—the variations in wordings are captured in order to strengthenthe score (weight) assigned to certain text zones within each document.The text zones are a fundamental component in the method aiming atdirecting the users' attention to zones where the link set for theconstituent sentences indicate a bundle of focused words or severalco-occurring focused words. More specifically, the idea is: When theuser selects a document for exploration, a text sounding board willsignal information about zones in the document. With reference to thehighly simplified example above, the board will inform the user that thecurrent text (selected by the user and displayed in the text pane)contains 7 identified zones dealing with [stat/statlig <associated with>selskap] and preferably with indicators for discourse elements attachedto the entry.

In the example below, let say that the user has selected the entriesmarked in boldface (Thematic association AND Discourse ElementIndicator). The zones satisfying both criteria will be highlighted inthe text pane. The criteria are grammar based, based on frequency (linkset) and based on semantic-pragmatic information, (zones with severaladjectives in the comparative form signal evaluations). HCI factors arenot determined in the present embodiment of the invention, However, textzones are highlighted in light grey, the words signalling the selectedthematic association (focused topics) are marked in blue, and thesignals for discourse elements marked in another colour (green).

The reader can explore these zones and judge whether they provideinformation useful for the task at hand, and further activate an optionfor navigating to the next zone with the same focused theme, or shift toanother focused theme. The display of zone information can be adjustedto text zones with the highest weight (score in link sets covering thezone and zone tuned against density measures) or display the zones inthe order they appear in the text. Each zone has its own identifier(identifiers of edge sentences, thus making it possible to manageoverlapping zones) and the traversal paths are defined over theseidentifiers (stored in separate files in the MAFS).

TABLE 4 Zone Themes Discourse Element Thematic Associations Indicator<is associated with> stat OR statlig selskap problem . . . evaluation(high frequency of adjectives in comparative form within the zone) . . .solution . . .

A zone with several repetitions of the same noun (orphrases/paraphrases) over let's say 20 sentences (not necessarily sothat the noun occurs in each sentence), will be assigned a higher weightif the noun also occurs with the syntactical function ‘Subject’ (tagproduced by CG-taggers). The link sets are based on the known textlinguistic theory that bundles of the same word (or paraphrases) oversome few sentences indicates a thematic zone. Increased precision may beobtained with words signalling discourse elements, for example by cuephrases for problem, solution, evaluation, etc. The idea is simply: If aset of nouns co-occur with one or several cue-phrases signalling aproblem (indicates utterances about problems), this will evoke anotherinterpretation than if the noun set co-occurs with cu phrases signallinga ‘solution’. It is important to stress that the present invention usesthe concept of indicators. Minor variations in language use, sincerityconditions, illocutionary force, etc. makes it impossible to statesemantic relations in absolute terms.

The combined collocations reveal how often instances conforming to agrammar pattern occurs in the text. Since the invention is based ongrammar patterns it is possible to discern (capture) nuance ofutterances (text is seen as mediating communication between actors in asocial context).

Thus the pattern (adjective comparative <followed by> noun), (and withoptions for regulating the distance as known in GREP searches), may givea result as ‘better company’ which may indicate an evaluation. Thepattern (adjective positive <followed by> noun) resulting in ‘goodcompany’ might in turn indicate more like a viewpoint (evaluationcompleted). A set of grammar based request patterns will be able tocapture several types of wordings in evaluative utterances. The pattern(adverb <f-b> determiner <f-b> adjective <f-b> noun) may give the result‘not a good company’. In many systems words like ‘not’ and ‘a’ wouldhave been eliminated by the stop list, although ‘not’ indeed changes the‘meaning’ of the utterance. The combination of grammar patterns andcollocations allows for a flexible extraction of information applied inthe construction of attention structures.

Word Cue Phrase

Cue phrases are words or phrases that signal information about thediscourse structure. A separate part of the classification schemedescribes classes of cue phrases with reference to their role asindicators for discourse elements, and a set of criteria prescribingguidelines for the assignment of hyper textual links between text zonesannotated with discourse indicators. The underlying assumption aboutdiscourse structures encompassing main lines asSituation-Problem-Solution-Evaluation is widely known within the fieldof text linguistics.

Generally, cue phrases are ambiguous and highly context-dependent. Thepresent invention focuses on a rather small set of cue phrases (nounsand verbs) that may be said to send inscribed signals for thesuperordinate argumentative function in a text span. For instance, thenoun ‘conflict’ is regarded as an explicit signal of some sort ofproblem being discussed. The specific type of signal used vary acrossgenres, i.e., a word like ‘crisis’ is frequent in news reports, but whenused in a governmental report it may be considered as an extraordinarysignal and classified as an important word. Implicit signals forproblems or solutions must be identified and interpreted by humans. Thedevices applied should preferably make it possible for a reader (user)to assign codes for discourse structure, and a set of code proposalsrelated to text genre may support the user in this task. This calls fora device serving user-added codes and may be seen as a preferredfacility connected to options for user-added text. The device for targetword selection supports the identification of explicitly inscribed cuephrases.

Identified text zones (defined by zone link sets capturing certainlexical and grammatical information associated with pairs of sentences)can, in the present embodiment of the invention, be intersected with cuephrases (words or word constellations) that signal or indicate discourseelements. This covers for one of the elements in the notion that textreflects how the author's focus of attention is moving Cue phrasesindicating how the discourse evolves are considered as essential for theclassification of text zones with respect to the notion of discourseelements. Cue phrases that indicate discourse elements are words andword constellations that signal description of situations, backgroundinformation, negative evaluations indicating problems, explicit signalsof problems, solution proposals, solution comparison, solutions selected(decisions), evaluation of solutions, and so on. Some of theseindicators are directly inscribed in text, others are subtler and needsin-depth analysis in order to be captured and if users are willing toaccept the costs involved. It is important to note that other types ofcombined information adds to the cue phrases in the process ofidentifying or indicating discourse elements. These other types ofinformation include verb tense, broad semantic classes for verbs, broadsemantic classes for nouns (e.g., juridical persons, physical entities,events, etc.) and the nouns' grammatical form, characteristic use ofadjectives indicating evaluations, and so on. The present inventionseparates devices capturing the different information types, thus makingit possible for the user, or preferably by pre-defined search macros, tocombine contacts displayed in the text sounding board that refer to theinformation types captured from the underlying texts.

Organised sets of cue phrases have to be constructed for the purpose andby manually intervening computer-supported procedures that apply ‘linkcycles’ in a Target Word Selection Procedure (TWS). The procedure startsby extracting known cue phrases and synonyms (or related terms) inpreferably general thesauri. Any language includes words (nouns,adjectives, verbs, adverbs) referring to, or immediately evokingassociations to for instance problems. Some simple examples mayillustrate the point given; a noun set as {pollution, disturbance,crisis, crash on exchange, trouble, etc.}, and similar for adjectives as{declining, insecure, failing, difficult, laborious}. Based on a startset chosen from a general thesauri by following links of termrelationships, the TWS procedure's first cycle identify and locatematching word occurrences in the text and identify content words in theadjacent sentences (preceding and subsequent with options for regulatingdistance). The application designer examines the results from the TWSprocedure's first cycle, and marks words or phrases of interest, andwhich may have a discriminating effect for the identification ofdiscourse elements. The new detected cue phrases from the inner contextin the neighbourhood of matching words from the first TWS cycle aretransmitted to the second TWS cycle. Likewise, again new detected cuephrases are transmitted into the third TWS cycle, and so on. At present,attempts to locate word lists organised according to the principle ofdiscourse elements have been unsuccessful. If located, such availableword lists may preferably provide for the initial target word settransmitted into the first TWS cycle. However, potential available wordlist must be tuned according to the prevailing genre in the currentdocument collection selected by the user for review/exploration. It iswell known from sociolinguistics that style and vocabulary differsbetween professions, level of authority, level of norms (legacy or othersocial norms), level of competence, and so on.

Criteria Pragmatic Author Focused Information

Author-focused information is information emphasised by an author of thedocument. The generation of word lists comprising focused words, maystart by including nouns and noun phrases specifically signalled by theauthor in for example titles, headers at all levels, sections withparticular lexical signals, particular chapters, or other lexicalsignals by which the author presents important points of the document.The device for generation of particular word lists includes series ofcriteria, which may be characterised as pragmatic, i.e. features relatedto the documents' situational context, and features related to the usercommunity.

Word Focus

The concept ‘focused word’ or author-focused word refers to one type oftextual signature, i.e. the set of words registered in the set ‘WordFocus’ shares some characteristic qualities reflecting aspects of theauthor's attention when writing the text. In particular, the devicesgrouped as mostly related to the ‘Apparatus Zonation’, operate on theDPB Word Information and derive sets of author-focused words in eachintratextual context. Each set is further consolidated to cover forintertextual contexts, i.e. the same word type occurs in several textsextracted from a document collection in which the documents' share somefeatures referring to the situational context.

In the first round the relative frequency of the main grammatical wordclasses are calculated and word types occurring above a certainthreshold value (determined intratextually, for example from 0.03 to0.05) are stored in a temporary file. In the next round this set ofwords are intersected with the combined collocations in order toidentify how and to what extent the words in the temporary file areregularly modified by other words also registered in the temporary file.The detection of word constellations that regularly co-occur causes aweight score to be assigned to the word types constituting the regularword constellations. The regularity is assumed to signal one aspect ofthe author's attention. The Word Weight value will preferably also beincreased if the device for word classification conveys that the word isto be considered as an important word. For example in cases where theword matches with entries in a list of user-focused words or otherexisting word lists, or words/terms encoded in domain specific thesauri.

Additionally, the word weight can be increased in the word isre-occurring in sentences classified as important sentences. Anotheroption is to assign an added score if the word frequently appears insentences where the main verb is in the present tense. These sentencesmay indicate a higher significance than other sentences in that some ofthe author's opinions will be uttered in the present tense, i.e.referring to the logical now in the author's situational context.

Author Focused Words and Redundancy

Author-focused information is information emphasised by an author of thedocument. The generation of word lists comprising focused words, maystart by including nouns and noun phrases specifically signalled by theauthor in for example titles, headers at all levels, sections withparticular lexical signals, particular chapters, or other lexicalsignals by which the author presents important points of the document.The device for generation of particular word lists includes series ofcriteria, which may be characterised as pragmatic, i.e. features relatedto the documents' situational context, and features related to the usercommunity.

Redundancy in the form of the same nouns highlighted in all the zones inwhich they occur, can affect the users' perception of ‘newness’ or theability to distinguish the moves in the author's focus. The presentinvention gets around redundancy by providing predefined focus chainsand highlighting ‘new words’ (nouns except those forming focus chains)within each zone intersected by the focus chain. The words in focuschains are displayed in special-designed exploration panes (textsounding boards) from which the user can select preferred chains fortext traversal.

In order for the user to select a text (from a document) from at set oftexts (from a document collection), the present invention informs theuser how the author of a text deals with a theme by providing severaloptions for navigating through the text zones. The user traverses textzones and can preferably at any point give instructions for explorativedirections (pass all zones hastily and/or consider only zones withcertain word occurrences). At each cursor point, the user may preferablyshift from one navigational option to another.

In the text sounding board, the author-focused information is visualisedin panes making it possible for the user to apprehend or conjure up amental image of themes in the text. The pane content may be restrictedto show only words within zones with strong cohesive link sets, wordswithin zones that contain certain words (for instance as specified bythe user), only zones that are bonded or all zones. It is also possibleto show word collocations from all sentences (in all cases the panecontent is restricted to content words and with specific syntacticfunctions within sentences). One particular pane is denoted as the‘triple track’. From this pane, the user can select (activate) on orseveral words in one of the tracks, and the zone traversal path isaccordingly restricted to those zones containing the user requestedinformation. After selection, the panes will inform the user of thenumber of zones included in the traversal path. When a author-focusedword in for example the left-most track is activated first, with theword in the first track, and in accordance with words in the roles ofsubject, verb and object. The apparatus underlying the triple trackgives a grammar based attention structure that makes it possible tocomprehend features of the selected words' nearest inner context. Bymoving up, down and sideways in the triple track, the user gets a themeview of the texts. At all times, the text displayed in the text panewill be adjusted in line with the words that the user selects/activatesfrom the triple track. A special device is designed for the automaticalignment between the content of the triple track and the content of thetext pane. Preferably the direction of alignment can be switched, inthat when a user moves the cursor downwards or upwards in the text pane,the triple track captures and reflects information stored in the linkssets for the text zones passed over.

User Focused Information in the Vicinity

In addition to the apparatus for identifying zones in the texts, thereis an apparatus for presenting user-focused information in a text paneinterconnected with the content displayed in the text sounding board, inwhich the user is given flexible options supporting text drivenexploration and navigation. The Top Layer in MAFS contains files inXML-format, and the present invention applies well-known techniques forvisualisation in accordance with XML-tools known in the prior art.User-focused information is transmitted via the text sounding board,that is, when the user is exposed to the content in the various panes inthe text sounding board, she may select/activate parts of the content.The content selected/activated at a point in time is denoted as the‘current user-focused information’. The user-focused information, i.e.,a set of particular words or particular words in combination, ishighlighted in the interconnected text pane. The user-focusedinformation may be highlighted in the whole text or the highlighting maybe restricted to appear within the zones that enclose all or parts ofthe user-focused information (the zones themselves being highlighted ingrey tones). The user-focused information is thus displayed in its innercontext (the text from which the content presented in the text soundingboard is derived).

The present invention supports options for the display of words in thevicinity of user-focused information and the present embodiment of theinvention supports flexible options that can be tailored to meet theneeds of particular user communities. For example, if a user hasselected and activated a distinct noun or sets of nouns displayed in thetext sounding board (nouns referring to one specific actor (Statoil) orgroup of central actors, e.g., oil companies), these nouns arehighlighted in the text pane. As mentioned above, the user can restrictthe visualisation device to only highlight nouns that occur within thezones where the particular nouns are included in the zones' link set. Inthe case of nouns in this example, the user can request for the displayof immediate subsequent verbs for each occurrence of the noun or groupof nouns in question. The set of subsequent verbs can further berestricted according to tense and/or according to broad semantic classes(see Word List Semantic Class Verb). Such broad semantic classes includeverbs that commonly and explicitly refer to certain types of utterancesor acts. Thus, the user can determine details with respect to the plainattention structures, and in conformity with the underlying grammaticalinformation transformed into a specialised XML-format stored in the TopLayer of MAFS. Preferably the users are supplied with a wide assortmentof pre-defined and preferred search and display macros because it cannotbe expected that users master the grammatical information concealed inthe underlying file structures.

Criteria Pragmatic Sentence Type

Successive sentences may be evaluated as being of the same type. Forinstance, sentences or sub-sentences that are list elements. In othercases a feature may be that sentences do not contain any subjects,verbs, etc.

It is expected that most documents will be produced according to anXML-schema, know in the prior art. This will relieve the design work,and these issues are therefore not dealt with in any detail. However, atspecial detector locates sentences without verbs, as is often the caseof lists. If the texts are well formed in conformance with XML, thedevice that calculates zone borders can easily be adjusted to cover forsuch more typographical features.

Sentence Class

The Sentence Class information may be used when identifying andrepresenting superordinate semantic structures in texts i.e., discoursestructures.

For instance, if the sentence contains words classified as ‘ActorImportant’ or words or phrases classified as indicating discourseelements, this attribute type in the DBP Information Sentence can beassigned a code (or a set of codes) representing such classes ofsuperordinate argumentative function or discourse elements. The set ofcode types is inscribed in a classification scheme. The codes arefurther utilised in the zonation process, specifically the device thatcalculates zone weights. The zone weights are further transmitted to adevice that generates zone traversal paths, see ‘Device Zone BondGeneration’. The criteria based on sentence class influence the rankingof zones to reflect one aspect of ‘importance’ (pragmatic weight.

Word Position Relative

A particular device subordinated to the device for zone bond generationutilises information about the words' relative position withinsentences, information that is also preserved in matrix consolidatedinto the zone link set. The device also takes into account the words'grammatical class and for nouns also their syntactical function. Thecalculation of similarity is a simple vector comparison. However, if thesentences, or alternatively are within zones marked with a discourseelement indicator, the display of otherwise similar sentences can benotified as different.

The following elements constitute parts of the information in the linksets and the words listed in order of appearance in the sentences.Sentences marked as 1 and 2 share 4 noun elements, of the 4 nounelements are in the same order. The 4 nouns have almost the same weight(government <is a> important actor, petroleum fond <is a> focused word,Norwegian <is a> important word (determines location), welfare state <isa> focused word). Shared subsets of words and with almost the sameweights assigned indicate thematic relations. However, sentence number 2also includes a clear sign classified as ‘Problem indicator’, andsentence number 1 correspondingly includes a clear sign of an expressedopinion (ascertain). A clear sign diversifies two sentences, which shareanother feature—four identical words in the same order. If suchdiscourse signs are differentiated, it will accordingly be possible todistinguish between otherwise like sentence surrogates (i.e.,representative extracts from a sentence) and this information canfurthermore be signed in the text sounding board. Sentence 3 and 4 showan example from two political party programs, the same political party,from 1993 and 1997. Long distance comparisons, across texts with aconsiderable time span in between, can yield content transmitted to thetriple track that may be of interest to user engaged in a in-depthcomparative text exploration. The two texts in this example share somefeatures related to the situational context—they are both extracted frompolitical party programs, and related to the same political party, andsurely indicates difference in opinion. In the latter two cases the noun‘government’ is also encoded as subject.

-   S1 {government, ascertain, petroleum fond, secure, Norwegian,    welfare state}-   S2 {government, petroleum fond, Norwegian, welfare state, danger-   S3 {government, await, gas power plant}-   S4 {government, support, gas power plant}}

The present invention relates to a method and system for theconstruction of attention structures that lead the users' attention tocertain parts of the texts. There is made no attempt to calculate a‘similarity of meaning’. If 10 or 100 different sentences containexactly the same set of nouns, and even in the same order, the presentinvention does not suggest that the sentences have the ‘same meaning’.Minor differences, for instance related to adverbs, adjectives insuperlative form or more advanced variations related to rhetoricphenomena, may give raise to dissimilarities in ‘meaning’. Degree ofsimilarity based on the notion of ‘content as a string of words’, mayindicate that items (sentences or larger text spans) are thematicallyrelated.

Device Pragmatic User Profile Spin-Off

The User Profile Spin-Off is simply a set of words or phrases based on aprocessing of the User Profile aiming at providing criteria applicablein for instance the device for constructing lists of user focused words.Subsequently, lists of user-focused words can be transmitted into theset of grammar based request patterns, in which some of the grammarbased operands iteratively are substituted with user-focused words.

In this setting, the grammar based request patterns are made morespecific. The particular device selects the grammatical search macrositeratively, and for each of them, also iteratively, replaces fixedoperands with operands referring to the content of the User Profile. Thefixed operands thus change their role to be ‘open operands’ and wherethese open operands successively are filled in with words known to be ofinterest to a particular user or a group of users. In the devicementioned, the open operands will preferably be automatically adjusted,for example by iteratively increasing a distance operator allowing 2 ormore ‘not specified words’ in between the sequence of words extractedfrom the User Profile.

The particular device requires that the User Profile be pre-processed bya grammar tagger, at minimum POS-tags covering the four majorgrammatical word classes. The set of grammar based search macros canaccordingly operate on the content of the User Profile and preferablyregister bundles of frequent pattern occurrences. The User Profile mayinclude stretches of texts that the user previously has marked out asnoteworthy. In a similar way as mentioned above, the operands within thepattern occurrences identified can be replaced by open operands, thusfunctioning as a iteratively changed search expression operating on thetexts that the user finds of relevance to explore. The patternoccurrences are iteratively transformed to changing generalised patternsin that the open operand can be ‘moved’ one position at the time withinthe pattern.

This particular device can, presupposed the existence of a User Profile,relieve the user from many tasks related to text exploration andnavigation. The applications of the present invention can adapt to theUser Profiles and for example only generate zones and contacts displayedin the text sounding board that into some detail matches informationheld in the User Profile. The system can thus automatically deliverattention structures, which to some degree, reflects information thatthe user previously has experienced as convenient in a text explorationtask.

Apparatus for Filtering

The previously described apparatuses in the present invention(acquisition, segmentation, disambiguation, zonation) generated theinformation stored and managed in the present inventions databasepartitions managed by a DBMS and IRMS. The integrated set of databasepartitions constitutes the present inventions's selectivity.

The selectivity of the present invention incorporates and supports:

-   -   grammatical information derived from CG-taggers    -   semantic information and the transfer of techniques related to        thesauri construction    -   pragmatic information related to text understanding and features        related to the situational context    -   statistical information derived from applying a reference corpus        and computing keyness, and keyness of keyness    -   frequency information combined with grammatical information in        relation to interconnected documental logical object types    -   zonation and filtering realised as intersecting chains, which        embody the various types of information, outlined above

The present invention aims to enhance the exploration of text throughvarious filtering options that operate on the database partitions. Textexploration involves a variety of tasks ranging from problem definitionto the assessment of to what extent the presented results in thepreferred interface is useful with respect to the particularinterpretative task the user is engaged in. The present intends tosupport the user in these tasks. Basically, the approach is based on apredefined set of filtering options specifying rules for reducing andextracting information from the database partitions, as illustrated inFIG. 10. The filtering options organise the contacts (arranged either insimple panes, and/or in the most advanced option, which is embodied inthe triple track) and accommodate these to various types of movesperformed by the user operating on the text sounding board. The termmove refers to the users' actions (selections and activations ofdisplayed contacts).

Traditionally, there is made a distinction between three kinds of movesor selections based on how the user evaluates the search result:

-   -   there are too many text units or contacts in the result        (futility point exceeded)    -   there are to few text units or contacts in the result,    -   or the text units or contacts are considered as not relevant or        off-target

The anticipated user moves are accordingly that the user will try toapply filtering options in order to:

-   reduce the retrieved set of text units or contacts (with the    intention to increase precision),-   increase the retrieved set text units or contacts (with the    intention to increase recall),

However, the selection of a new constellation of contacts or theactivation of a filter or predefined search macro will not necessarilyreduce the retrieved set. Instead a completely new set is retrieved. Theunderlying search selectivity determines in what ways it is possible toconstruct search expressions aiming at satisfying the user's searchintentions.

The present invention's preferred selectivity rests on the principleswithin the free-faceted classification scheme, which prescribesguidelines for the organisation of the set of chains that are generatedin the apparatus for zonation. The filtering options are, in principle,predefined search macros that are structured in layers, from general tospecific where the most specific involve iterative intersections betweenchains. The detailed description below explains the organisation ofsearch macros in layers, in which the lowest level are denoted asbuilding blocks, the next level comprises constellations of suchbuilding blocks (denoted as functions), and the level above comprisesgroups of functions, denoted as constructions.

When a user perceives her retrieved set as too large or too restricted,the underlying selectivity makes it feasible to construct a system thatdepict useful search directions by providing an orderly set ofpredefined search macros. That is, the links between search macros andthe displayed information shall tell the user whether a superior orsubordinate macro will increase or decrease the retrieved set. In thepreferred embodiment of the present inventions, the names assigned tothe macros, together with short explanations will provide informationabout the available set of filtering options. The exact realisation ofthese facilities is dependent on the HCI perspective adopted.

Classification Scheme

The classification scheme is a tool used when the propositional contentof a text unit (sentences) is analysed, reduced and represented into achain as explained in section ‘Apparatus for zonation’. Theclassification scheme preferably applied in the present invention isfurther elaborated in Aarskog (1999). The principles underlying the freefaceted classification give guidelines for the organisation of thevarious chains generated with reference to the set of zonation criteria.The chains are classified as members of one or more facets as specifiedin a classification scheme. In that each chain or subset of a chain canbe a member of more than one facet, the present invention embodies astructure allowing the same word type to appear in different semanticroles. For example, the contact ‘oil company’ may be a member of a facetfor ‘organisation’ and may at the same time be a member of a facet ‘oilaffairs’. The classification scheme also determines the associationtypes between categories and facets, in which each member in a facet mayembed another facet. A theme frame is a constellation of contacts, i.e.subsets of chains, organised as interlinked facets. The filteringoptions can capture these facets and intersect them with the informationthat is stored in the database partitions (DBP Information Zone LinkSet, DBP Information Word, DBP Information Chain, etc). The presentinventions' arrangement of subset of chains according to the rules orguidelines in the classification scheme supports the notion ofdata-independence. That is, it is possible to change the facets, orintroduce new facets, or relate set of facets to the interests ofparticular users and without subsequent changes in the filteringoptions. The main structure in this classification scheme is simple—itconsists of five categories. Each category is further divided intofacets, which in turn may be divided into more detailed facets. Thisevolving structure is based on free faceted classification principles inwhich the final set of facets reflects the classification performed. Thesubsets of chains assigned to each facet, and how these subsets arearranged within a facet, gradually determine whether a facet should bedivided into subordinate facets. The simple structure and the guidelinesfor use reflect an important perspective on the contentrepresentation—it is possible to construct very general theme frames andalso theme frames with high specificity. The decision on level ofgenerality-specificity will be based on what a certain user communityperceives as relevant to include in the system's selectivity.

Modus Operandi

The present invention embodies sets of devices that operate on thedatabase partitions in order to construct attention structures, whichare organised in various ‘modus operandi’. The design of the modusoperandi is inspired by theory from ancient rhetoric, specificallyCicero's ‘De Oratore I.xxxi’, and each modus preferably supports theactivities known as Inventio, Dispositio, Elocitio, Memoria and Actio.The design model based on ancient rhetoric is further elaborated inAarskog (1999). Predefined sets of filtering options will preferably bearranged in different levels of ‘modus operandi’ and the preferredembodiment of the present invention includes five partitions that eachpreferably will be put in conformity to modern HCI known in the priorart. The various ‘modus operandi’ are seen as conceptual model thatguide the structuring of the interconnections between the wide set offiltering options. Each level gives access to certain predefinedcompositions of basic building blocks (any kind of composition must bebased on a preceding decompositions). The set of filtering optionsoperate on the series of interlinked database partitions described inprevious sections, as well as the Top Layer of MAFS which keeps recordof all the intermediate files generated dynamically during thefiltering. Basically the filtering options comprise a set of iterativelyapplied reduction functions (each composed of a rather small set ofbasic building block, or search patterns with ‘open operands’). Thisdesign preferably opens for parallel processing. The reduction functionstransmit the results in intermediate files that are captured andcomposed by another set of functions. Finally the reduced and composedresult is transmitted to a device that sorts, ranks, styles inaccordance with HCI guidelines and displays the product in the textsounding board.

The filtering options in the present invention embody a particulardesign in which recursive applications of reduction functions are‘grouped’ together in higher order constructions. These constructionsare in turn is divided into five main ‘modus operandi’ These ‘modusoperandi’ manifest the complexity of the attention structures that aregenerated during the users' interaction via the text sounding board. Themiddle level constructions constitute ordered sets of filtering optionsin which group number 2 is more complex than group number 1, and wherenumber 2 compensates for disadvantages or failure of discrimination ordiversification encountered via group number 1. The term ‘disadvantage’in this setting means that the displayed result in the text pane and inthe panes of the text sounding board does not satisfy the users'intention. The particular design involving modus operandi enclosinggroups of constructions, which enclose functions, which enclose basicbuilding blocks opens for a highly flexible and yet efficient filtering.The building blocks constitute search operators known in the prior art,and a set of aggregated search operands, which in the present inventionrefers to the documental logical object types (DLOT) and the set ofattribute types attached to each type of DLOT. At the generic level theset of attribute types is denoted ATOT, the set defined for text isTATOT, the set defined for zones is ZATOT, the set defined for sentencesis SATOT, and the set defined for words is WATOT. Each set is realizedin separate interconnected database partitions according to the notionthat each DLOT encloses another DLOT, and in which zones are consideredto be derived documental logical object types.

The principle of enclosure guiding the design in the preferredembodiment of the invention provides for a system's model andimplementation that may be conceived as an ‘ever expanding matrix’. Thatis, the specification is based on the principle of rounds and levels inthe free-faceted classification scheme. A design following theguidelines of a free-faceted scheme allow for parallel processing whichis preferred in the present invention.

The set of Modus Operandi is considered as ‘conceptual models’ for whatis to be included (or displayed at one point in time) in the presentinventions interface, preferably denoted as the Text Sounding Board,which is interlinked with a Text Pane. The adaptation of HCI factorswill preferably make the interface comfortable in that it is consideredas important that not too many options available at ‘all times’. HCIfactors will preferably guide the transition from one modus operandi toanother more advanced modus. With respect to the users' need for depthin their exploration and navigation, the interface should preferablydisplay a minimum set of necessary ‘buttons’ attached to the textsounding board partitions.

The present embodiment of the invention constitutes 5 Modus Operandigiven the preliminary names: Plain Modus Operandi, Crafted ModusOperandi (qualified), Quizzed Modus Operandi (multi-qualified), CommixModus Operandi, Virtuous Modus Operandi.

The following section briefly describes the structure of enclosureinvolving both DLOTs and the main principle for the arrangement offiltering options underlying the various modus operandi.

Conceptualisation of Building Blocks in Search Macros

At the conceptual level, the filtering options may be conceived as acollection of building blocks that are interconnected in a multi-layeredsystem of predefined search macros. The search macros are organised innetworks following the same principles as in the construction ofthesauri, which is the principles underlying the free facetedclassification theory. The inter-linked search macros form an importantcomponent in the system of logical access points (system's selectivity)to text units. The idea behind the construction of a predefined set ofsearch macros is to build a tool set for information filtering. Thefocus is on the use of grammatical information extracted from the outputfrom CG-taggers that are transformed into codes embodied in APO-triplets(part-of Theme Frame). Nominal expressions are separated into two facetsdenoted as ‘Agent’ (nominal expressions with the grammatical functionSubject within a sentence) and ‘Object’ (nominal expressions with thegrammatical function Object within a sentence). A Theme Frame differsfrom other types of word lists in that words with certain grammaticalfunctions are displayed (default option) in their order of appearance inthe text. The main grammar pattern model is composed of two sets ofregular expressions:

One set operates on two main search operand classes, the grammaticalword classes of nouns and adjectives. These regular expressions giveaccess to the texts ‘world-building elements’. The other set of regularexpressions operates on the grammatical word classes of verbs andadverbs, giving the indicators of ‘function-advancing elements’ in thetext. These two sets, together with other regular expressions operatingon other word classes, provide a grammatical grounded selectivity.Combined in search macros made available in special designed windowpanes (with all the functionality that follows), the user can exploreunderlying text and further make moves that reduce or increase thesearch span.

The overall structure at the highest abstraction level is outlinedbelow:

-   Type: Grammar Pattern    -   <gives rules for> cat5 fac0 Subject Matter    -   <gives rules for> Type: Search Macro    -   Type: Grammatical Information <is abstracted into>    -   Type: Regular Expression <is part of>-   Type: Grammatical Information    -   <output from process> Process: Text Disambiguation    -   <is derived from> Type: CG Tagger Output    -   <is abstracted into> Type: Grammar Pattern    -   <is assigned to > Type: DLOT Word    -   Type: Grammatical Form <is a>    -   Type: Grammatical Function (GF)<is a>    -   Type: Grammatical Word Class (GWC)<is a>-   Type: Grammatical Function (GF)    -   <is a>Type: Grammatical Information    -   Type: GF Object <is a>    -   Type: GF Subject <is a>    -   Type: GF Transitivity <is a>    -   Type: GF Verb Tense & Modality <is a>-   Type: Grammatical Word Class (GWC)    -   <is a>Type: Grammatical Information    -   Type: GWC Adjective <is a>    -   Type: GWC Adverb <is a>    -   Type: GWC Noun <is a>    -   Type: GWC Verb <is a>-   Type: GWC Noun    -   <is input to> Type: Filter Noun    -   <is a> Type: Grammatical Word Class (GWC)    -   <is part of> Type: GWC Nominal Expression    -   Type: GWC Noun Common <is a>    -   Type: GWC Noun Proper <is a>    -   Type: SVO Entry Noun <is subset of>    -   Type: SWC Noun <refers to>

The main set of search operand types is outlined below, and the nextsection describes these into more detail with respect to how they arerelated to basic building blocks, functions and groups of functions.

-   Type: Search Operand    -   <is input to> Type: Search Macro    -   Type: Association Type <is a>    -   Type: Attribute Type attached to DLOT (ATOT)<is a>    -   Type: Category <is a>    -   Type: Chain <is a>    -   Type: Code <is a>    -   Type: Code Family <is a>    -   Type: Contact <is a>    -   Type: Documental Logical Object Type (DLOT)<is a>    -   Type: Dublin Core Element Set (DCE)<is a>    -   Type: Facet <is a>    -   Type: Free-Text Index Term <is a>    -   Type: Frequency Information <is a>    -   Type: Open Operand <is a>    -   Type: Search Macro <is a>-   Type: Documental Logical Object Type (DLOT)    -   <is an object in> Type: Document in Collection    -   <output from process> Type: Search Macro    -   <is a> Type: Search Operand    -   <is a> Type: Text Unit    -   Type: DLOT Header <is a>    -   Type: DLOT Identifier <is assigned to >    -   Type: DLOT Paragraph <is a>    -   Type: DLOT Sentence <is a>    -   Type: DLOT Title <is a>    -   Type: DLOT Token <is a>    -   Type: DLOT Word <is a>    -   Type: DLOT Zone <is a>    -   Type: Theme Frame <refers to >-   Type: Frequency Information    -   <refers to > Type: DLOT    -   <is a> Type: Search Operand    -   Type: Frequency Chain Level <is a>    -   Type: Frequency Document Level <is a>    -   Type: Frequency Grammar Level <is a>    -   Type: Frequency Sentence Level <is a>    -   Type: Frequency Zone Level <is a>    -   Type: Sentence Density <is a>    -   Type: Sentence Weight <is a>    -   Type: Zone Density <is a>    -   Type: Zone Weight <is a>

The specification of search operands shows that search macros also aresearch operands (recursive). This means that an active search macro atany time can be combined with search operands referring to the contentof the various types of categories and facets being specified accordingto the rules given in the classification scheme. The category Agent isby default divided into facets for persons, organisations,social/work-related positions, and other types of subject matter dividedinto facets based on semantic criteria. These categories/facets can beactivated as additional filters operating ‘on top’ of the grammar basedsearch macros. The search macros and filters are further organised inlayers, and interlinked in a semantic net.

The codes assigned to the categories/facets in the second layer areresults from the target word selection procedure, but also includeswordlists extracted from public available information (register of jobtitles, register of companies, etc.). These filters will of course haveto be tuned according to what a certain user community may findinteresting to make ‘more’ retrievable.

The search operands, including search macros, organised in networks, infact represent a kind of ‘concept abstraction’. The degree ofabstraction when these concepts are used as search operands will ofcourse have an effect on retrieval results. A proper realisation of thisstructure should therefore include options for query modifications. Asearch macro represents a conjunction and/or disjunction of severalsearch operands, each referencing a certain level in a concept hierarchy(index terms organised in abstraction levels). The user should be givenoptions to select ‘moves’ for each of them separately—for instance byproviding options for moving up one or several levels (query expansionaiming at higher recall) or down (query reduction aiming at higherprecision. Each search operand is considered as an object with optionsfor showing embedded codes (index entries) or embedding codes. When auser selects a replacement, this new index entry is the current searchoperand within the modified current search macro (and the modifiedsearch macro can be stored for later use).

The upper layer of predefined set of grammar based search macros aredirected towards the component APO Triplet (part of Theme Frame). When auser has explored the results from activating these search macros, shecan then activate components in a system of more ‘specific grammar basedsearch macros’. These are regular expressions preferably with names thatgive a meaningful signal to the user. The option has some resemblancewith traditional KWIC indexes; however, they are made available on topof more forceful grammar based reduction devices as explained in thesection ‘Zonation Criteria’. Combined properly and according to specificneeds in a user community, and not at least, given names that signaltheir characteristic features, the user will have a forceful andsophisticated exploration tool at hand.

An interpretative layer of search macros is founded on issues related todiscourse described in the section ‘Zonation Criteria’. Within text,there is a kind of superordinate communicative function and it ispossible to identify cue phrases (or lead functions) for portions of thetext. For instance, when reading a text, the reader experiences thatsegments concern ‘a certain actor expressing opinions about ‘something’considered to be a Problem’, ‘an actor argues against proposedsolutions’, ‘solution proposals are evaluated or compared’. These leadfunctions are discovered during an interaction between the reader andthe text (the text being a delegate on behalf of some author). However,highly structured text from professional authors (and markedly withinsome professional domains as for instance law), the text containsstructural signals as well as lexical signals that mark out somesections in the document. In order to identify and encode these textportions, it is necessary to record phrases (word constellations) thatsignal lead functions. Cue phrases are phrases with lexical signals(words) indicating some aspects of the thematic matter dealt with in atext span (sentence, zones) may be registered in a separate facet (cuefilter). However, lexical signals to for example problem may be explicitor implicit, in the latter case for instance expressed as negativeevaluations of the situation described (including a negative evaluationof a proposed or selected solution). The establishment of such contactsis thus considered to be of semantic-pragmatic nature and an exhaustiveencoding will by necessity require human intervention/validation. Ifthis is of interest in a user community (balance between cost andperformance), cue phrases that have a high score from validationprocedures will be included in ‘Type: Filter Cue Phrase’. These filterswill vary with respect to document genre (laws, reports, etc).

When encoded, a search macro giving the user options for selecting afacet as filter, will retrieve these text units (the address of alltypes of units may be derived from the documental logical object typeDLOT Identifier).

Since the main search macros operate in a rather limited set ofgrammatical tags, they will not avoid the ambiguities in the text.However, compared to the traditional free text searches (even withneatly designed interfaces and user support), the present embodiment ofthe invention shows that filtering based on quite simple regularexpressions is promising. When realised in full scale, this set oftechniques has a prospect of interest to users within various types oforganisations. Different user communities must preferably be supportedby tailored search macros based on the combination of grammar basedsearch operands (rather static) and semantic search operands(dynamic/evolving). The arrangement of search macros can be tuned toserve typical information needs within a ‘user community’. The questionis what could be seen as a minimal and necessary set of search macrosand what is the ‘best’ way of arranging these in layers. The presentembodiment of the invention will preferably be transferred to anexperimental setting where user representatives within a certain domainwill provide feedback in the process where the present embodiment of theinvention is to be converted into a robust technological platform.Representatives from the chosen user community will be exposed todifferent sets of grammar based search macros, filter options, interfacedesign, etc. By interviewing the representatives, the goal is toidentify how the components should be interlinked in a detailed designin order to maximise the system's potential exploratory capacity.

Overview of Functions and Groups of Constructions

The descriptive overview of basic building blocks, functions,constructions and modus operandi is given in a simplified version of themeta-language BNF.

There are three important, yet simple and basic functions that operateon the database partitions (DBP). These are:

-   -   Reduction Function (ReFun): Denotes the basic set of functions        that reduces the files/tables according to reduction criteria        (see section ‘Apparatus Zonation’). The reduction functions        detect matches between external values (given/selected in the        text sounding board) against all internal values (stored values        in the set of DBP) thus producing a set of entries denoted as        Logical Access Points (LAP) that are further processed in the        Extraction Function.    -   Extraction Function (ExFun): Denotes the set of functions that        extracts one current text or part of text (one or several        sentences, one or several words, etc) as specified in the text        sounding board.    -   Attention Function (AtFun): Denotes the set of functions or        compositions of ReFun+ExFun. The compositions produce        text-driven attention structures in the more advanced Modus        Operandi.

<Text Sounding>::= ( <AtFun> ) | ( <ExFun> ) <AtFun>::= <opecal> (<ExFun> ) | <opebol> ( <ExFun> ) <ExFun>::= <atot> | <atot> (<ReFun> )Object Types and Attribute Types (DLOT and ATOT)

The operations (denoted as building blocks, functions, and groups ofconstructions) operate on the DBPs (database partitions), organised aslayered levels, each subordinate level consolidated in the one above, ofinformation about the object types that are pre-processed by theapparatuses for acquisition, segmentation, disambiguation and zonation.

The main concept is that of Documental Logical Object Types, abbreviatedas DLOT. These object types include all physical or derived objectscontained in a document. The notion of Object Type covers all types ofmedia—audio, video, pictures, text, etc. The present inventionpreferably focuses on the object type Text.

A Word is the smallest textual unit. A Word is a DLOT at the lowestlevel in a hierarchic structure. Each Word is part of a Sentence, and aSentence may be part of derived DLOTs, denoted as text zones. If thetexts' structure is properly annotated with XML, the sentences may betreated as objects within common structural units as paragraphs,sections, etc. A Sentence is a part of a Text (though a specific textmay include only one sentence), and the Text is in turn a part of aDocument.

The list below shows the hierarchical arrangement of these DLOTs:

-   WITHIN (Word, Sentence),-   WITHIN (Sentence, Zone),-   WITHIN (Sentence, Text),-   WITHIN (Zone, Text),-   WITHIN (Text, Document),-   WITHIN (Document, Collection).

A Zone is a derived Documental Logical Object Type (DLOT) and is basedon the calculation of feature similarities between pairs of sentencesthroughout the text (See ‘Apparatus Zonation’). Each DLOT has attached aseries of Attribute Types. These named attribute types designate thedata resulting form various types of text processing. For instance,average sentence length, average word length, the ‘reading’ of eachword, the lemma of each word, the lemma's frequency within a text andwithin a collection of texts, the word's grammatical class, and so on.The apparatuses operate on the internal values assigned to these sets ofattribute types and transmit derived values that are assigned to otherattribute types.

At the generic level the attribute types are denoted ATOT, AttributeType attached to Object Type. The definitions of the building blocks,functions, etc, refer to ATOT, which may be considered as an ‘opengeneric operand’ that can be replaced by its specific kinds. Thus, DATOTdenotes the attribute types attached to Document (<is-a> (Document,DLOT)), (<attached-to> (DATOT, Document)), <is-a (DATOT, ATOT)). Theattribute types attached to the object type Word is denoted WATOT, thus<is-a> (WATOT, ATOT), <is-a> (SATOT, ATOT), and so on. SATOT refers tosentences, ZATOT refers to zones, and TATOT refers to text.

Each set of DLOT and ATOT and all operations performed on them aredocumented in the present inventions preferred Information ResourceManagement Systems, designed according to the general guidelinesspecified in the ISO standard for IRDS (1986).

Reduction Functions

The Reduction Functions (ReFun) operate on the database partitionscontaining data associated with the ATOTs. The present invention appliesa special-designed XML-file format, at present, stored and managed in aRDBMS known in the prior art. The files/tables are annotated andorganised in multiple levels (MAFS=Multi-Levelled Annotated FileSystem).

The reduction is based on a set of criteria and these criteria arespecified as belonging to certain types, see Zonation Criteria. Whenreferring to ATOT in the following description, this means any kind ofAttribute Type attached to any of the Documental Logical Object Typesstored in and managed by the system.

ReFun is defined as recursive, which means that the function may beactivated several times in a ‘nested’ fashion, and where theintermediate results are stored in intermediate files, and/or stored aspersistent files that are consolidated or ‘pushed upwards, in the MAFS,and/or possibly ending up for closure in the database partitions (DBP).

In the present invention ReFun is defined completely given that:

The system manages all attribute types available for further processing.Some of the attribute types together with their value sets are displayedin special-designed windowpanes in order for the user to select/activatedisplayed attribute types and/or value sets.

The internal values (those values that are actually stored in MAFS orDBP) and selected/activated by the user (or via a recursive call toReFun) are denoted as VALUE (abbreviation VAL).

A set of operators known in the prior art, which at the generic levelare denoted OPER (search OPERand). A Search Operator <is part of>Building Block. The definition of ReFun is therefore, basically, ageneral grammar describing allowable types of patterns for searchexpressions. The table below gives an overview of the different types ofoperators classified in two groups depending on the number of operandsit operates on (MONO=a single operand, and DUO=two operands).

TABLE 5 <ReFun>::= <operel> (<atot>, <val>)| <operel> includes therelational predicates (ant thus is a type of <opeduo> = operates on twooperands. All the different types of ATOTs are managed in separate DBPs.ATOT -> WATOT -> {Wo-Id, Wo-GC, Wo-lemma, . . . } If the currentwindowpane in the text sounding board is on the DLOT ‘Word’, this willconstrain the set of ATOTs available for reduction (or filtering) - inthis case the set of attribute types denoted as WATOT. Note: Thenon-terminal <val> includes the non-terminal <ExFun> as one of itsdefined elements. Example: EQ (Wo-GC, ‘NOUN’) will generate a list allthe word identifiers (entries) of words tagged as nouns. EQ (Wo-lemma,‘STATOIL’) will generate a list of all the word identifiers for thewords tagged to have the lemma ‘Statoil’. <opeduo> (<ReFun>, <ReFun>) |This operator binds together the results from two or several(recursively) activations of ReFun. Each activation returns one orseveral entries (logical access points (LAP), commonly identifiers(internal vales) or derived values, for instance pair of identifiers,transmitted for further processing. Example: AND ((EQ (Wo-GC, ‘NOUN’)),(EQ (Wo-Sem-Cl, ‘ACTOR’))) will generate a list of word-identifiers ofwords that are tagged as nouns and classified as belonging to thesemantic class Actor. These simple patterns may be combined into complexpatterns due to the recursive definition of ReFun. In order to useproximity operators or operators for enclosure, the intermediate result(word identifiers or other types of identifiers) must be assigned to atemporary file before a new activation of ReFun. <openeg> (<ReFun>) Thenegation operator may be applied on the result (entries) returned from acall to ReFun. Example: NOT(EQ (Wo-GC, ‘NOUN’)) will generate a list ofall the word identifiers (entries) of words not tagged as nouns (thatis, verbs, adjectives, etc).Extraction Function (ExFun)

This set of functions locates and extract internal values from the DBPscontaining data about the ‘object layers’, the set of DLOTs {Document,Text, Zone, Sentence, Word}. The adjusted DBMS keeps track of all theattribute types and internal and derived values attached to Document(DATOT) and all the attribute types and internal and derived valuesattached to the texts extracted from documents (TATOT).

In order for the user to receive a response she has to select one of thevalues displayed in the text sounding board. The displayed values areinternal values that are extracted in advance from the underlying text(the difference between external values, i.e. values that areselected/activated by the user and internal values is blurred). The useris also given options for formulating her own request (external valuesgiven as free text search expression), in case there is no guarantee ofa match against internal values (traditional IR options).

The Extraction Functions operate on current texts. The current texts aretexts that are available for exploration once the user has selected oneor several texts from a displayed list. The current texts are previouslypre-processed by the present invention's apparatuses (acquisition,segmentation, disambiguation, zonation).

When the texts have completed the pre-processing stages, they aredisplayed in the mode ‘Plain Modus Operandi’. In this mode, the presentinvention preferably gives the user options for activating series ofReduction Functions (ReFun). The idea founding the outline of ModusOperandi in the present embodiment of the invention is that the user inan ‘incremental fashion’ gets more and more advanced tools at hand.

In order for ExFun to operate adequately (for instance with respect toprocessing speed), it may be preferred that ExFun first activates seriesof ReFun in order to reduce the amount of ‘current’ ATOTs transmittedinto further processing. The performance issues will preferably behandled via parallel processing. The ExFun rules therefore include arecursive call to Refun that can operate on all the ATOTs internalvalues. The reduced set of ‘current ATOTs’ is stored in temporary files(or persistent files if the ATOTs in question are involved (input in)frequent Extraction and Reduction functions). That is, in principlethere is no difference between ExFun and ReFun. They are a sort oftwins, however ExFun locates and extracts internal values and storesthem in separate, temporary files designed for efficient processing.

A possible Reduction Function that is preferred in all Modus Operandi isthe reduction to certain zones of the texts (a reduction that affectsthe following selections). Thereafter the user may focus on the wordswithin these zones, in particular locating zones containing a value setfor ‘Problem Indicators’ (reduction). This either activates an existingchain, triggers the generation of a new chain (depending on internalvalue selected). Another typical reduction may be to select nouns in thesubject position where EQ(Word GC, ‘subject’). It is not expected thatthe users are able to formulate or select and understand the effects ofgrammatical codes. Therefore internal values (ATOT and values)preferably must be transformed into ‘understandable’ value sets. Theseissues are related to HCI factors.

TABLE 6 <ExFun>::= <atot> | Example: Extract all word lemma (Wo-lemma),which in fact is an application of ReFun: EQ(Wo-lemma, ‘open operand’)Minimum = 1, Maximum = constraint given by system and included asrestriction in ReFun (see below). Other types of user given maximumvalues is part of the set of functions within ReFun, <atot> (<ReFun> )That is, the ExFun calls ReFun in order to reduce the set of internalvalues displayed in a windowpane or reduce one text by for exampleselecting certain zones in one of the current texts. Zone-ID (AND((EQ(Wo-lemma, ‘LAW’)), (GT (Sentence Density, 1))) Current DLOT isZone. Extract zones where there are registered occurrences of the wordlemma ‘law’, and highlight those sentences where the particular wordoccurs more than once. The first part of the ReFun operates on the setWATOT, and the second part operates on SATOT. The Sentence Density iscalculated as for Zone Density, i.e. a sentence is considered to be azone in itself (in particular for short texts). The density value willreflect multiple occurrences of the same word within one sentence. TheSentence Weight reflects the ‘closeness’ of words in a Chain.

Attention Function (AtFun)

This set of functions is related to the composition of new objects(derived objects) or the composition of constellations of contactsdisplayed in the text sounding board.

AtFun produces derived values. The derived values are based on externalor internal, or previous derived values, along with activations offunctions in the set ReFun and ExFun. That is, AtFun operates onintermediate results returned from several (recursive) activations ofReFun (location, extraction and composition of temporary files/tablesare considered as a part of the rule described under ExFun.

TABLE 7 <AtFun>::= <opecal> ( <ExFun> ) | ExFun is specified with theoption for activating ReFun recursively. Before anything is actuallydisplayed or designated by pointers, the interim results may have to besorted or processed by using a quantifier of some sort, (max, min,number). <opebol> ( <ExFun> ) When the ExFun followed by one or severalReFun activations, there will either be ‘something’ in the interim ortemporary files or these files will be empty. The function transmits avalue telling whether ‘derived set is empty’ as a result to a previousfunction (either as activated by the user operating the text soundingboard or resulting from the application of any of the by recursiveactivation order internal reduction functions). The operator EXISTreturns the values TRUE or FALSE. For instance: TRUE for Zonescontaining a specific set of Nouns. In case the value FALSE isreturned - the system could display a message like ‘Sorry, there are nozones containing these nouns, but several sentences quite close to eachother contain these words. What about having a look at these sentences?If you would prefer to do so, push the button ‘Do-It’. This suggests aninclusion of the value OTHERWISE, or a new operator triggering thedisplay of an appropriate message, succeeding the initial return FALSEwhich triggers a search in the DLOTs at a lower or higher level of thecurrent DLOT. This follows the notion of an ‘inside out’ exploration andnavigation strategy. The triggers start with the densest areas (zones)and move outwards to sentence level and then finally word level. Thetriggers compute an alternative to what the user demands through herselections in the text sounding board.Details ATOT and Attention Structure

The concept ATOT denotes a set of attribute types, the set varyingaccording to the Documental Logical Object Type (DLOT) the ATOT isattached to. There are five basic sets of ATOT {DATOT, TATOT, ZATOT,SATOT, WATOT}. It is further preferred to combine internal valuesassigned to attribute types in the different sets. These sets ofattribute types form temporary files (representing derived objects, asfor example Chain. Bond, Traversal Path). The set of attribute typesheld in an ATOT is arranged as interconnected tables (common keypropagation). The type ATOT can therefore be further defined (ordecomposed) in the production rule <atot> as follows:

TABLE 8 <atot>::= <atot-name> | The name assigned to one of theattribute types managed in a set denoted as an ATOT. This is an internalvalue at the type level. The atot-name is the named sets of attributetypes attached to one of the DLOTs, {DATOT, TATOT, ZATOT, SATOT, WATOT}.The atot-name may also be a name assigned to a temporary fileconstructed by the system during processing (storing and managingtemporary data). These files contain a collection of internal valuesextracted from one of the ‘permanent’ ATOTs. The files (and theircolumns), follow standard naming conventions documented in the IRMS.Example: WATOT, will activate the currency indicator to the file/tablegiven this name. The text sounding board preferably will providewindowpanes from which the user can select and activate the variousATOTs and the internal values assigned to some of the attribute types(some types of internal values are considered to be of less importanceto an end- user). Constraint: The user has to select one or severaltexts as current text(s) before they can focus on other types of DLOTs(Zones, Sentence, Words). <atot-col [,atot-col]> | is part of the ruleis a less constrained variant of the rule below. The system (via themeta data file) keeps track of which ATOT a certain named attribute typebelongs to (each attribute type has assigned a unique name, followingthe naming convention of prefixing the attribute type with the first toletters of the DLOT in question (Te-, Zo-, Se-, Wo-). (Wo-CG, Wo-SemC1)list pairs of the words' grammatical class and values for semanticclasses {(noun, actor), (noun, economy), (noun, utterance), . . . }. Anyrequest for the extraction of atot- columns that does not include anidentifier, will be default be extended with the identifier assigned tothe set of ATOT in question (the identifiers are the entry points forextraction, display, further processing). The value pairs (ID, ‘value’)are consolidated and sorted (and/or ranked) before display. The user isgiven the option to select one or several values from the displayed set.Example: if the user selects/activates ‘actor’, all the words classifiedas being members of this semantic class will be highlighted(marked/visualised in the attention structures). All values in thecorresponding chain unfolded in a list pane and highlighted in the textpane. If the call involves the extraction of internal values assigned toattribute types in different ATOTs, the composition/assembly procedurewill be constrained by the key propagation principle. (Values areextracted from atot-columns with a default inclusion of identifier).Example: (Se-Id, Wo-SemClass) List pairs of sentence-id and the word'ssemantic class, {(se-01, actor), (se-01, adj-pos), (se-02, economy),(se-02, adj-neg) . . . } The Dublin Core Element Set (DCES) is a part ofthe set DATOT (Attribute Types attached to the Object Type ‘Document’).The identifier propagation system anchors an inner layer DLOT to anouter layer DLOT. By applying this rule, a current text will be linkedto its DCES. If the user selects one or several attribute types in theDCES like (doc-type, doc-producer, se- number), the system will respondby listing the set of document types {stortingsmelding,stortingsproposisjon, lov . . . }, and producers {OED, Odelsting}. Thesets of values can be further restricted by applying a ReductionFunction, for instance by operating on one of the attribute types in theset ZATOT, (LT (Ze-number, 50)).Each atot-col has assigned a ‘displayname’ (data kept in the meta data file) Example: Wo-GC is the name ofthe attribute type in the set WATOT denoting the Word's grammaticalClass. Preferably HCI factors provide guidelines for the naming (areplacement of Wo-GC with Word Grammatical Class is not a preferrednaming system. <atot-name ( <atot-col [, atot-col]> | During theapparatuses processing, several temporal files/tables will be generated.Data about these temporary files that in some occasions will begenerated according to previous user requests will not exist in thesystem's meta data files. They are kept in separate files and if notedas successful preferably are pushed upwards to the User Profile. Thesetemporary files/tables may be considered s derived ATOTs, that is, a setof attribute types generated during processing and with value setsresulting from various types of calculations. The specific atot-namewill be generated by the system according to preferred namingconventions. For instance, if the derived ATOT is based on calculationsperformed on attribute sets for Zones and Sentences, the derivedatot-name should reflect this in the name's prefix. Similar namingconventions apply for the set of derived atot-columns (containing thederived value sets). Example: Wo-Zo-Temp 1 (Zone-traversal-path, zoneorder). The user may request for detailed information about text zones,(in fact initially a derived DLOT). When the user activates the Zonebutton, a pop-up menu informs the user about traversal paths acrosszones that include a set of current search operands (the WITHIN Zoneoperation). The traversal path is derived from data extracted from theZone Link Set. For instance, the user's current request may be like {OR(OR ((Statoil AND Aksje), (SDØE AND Eier)), (Gass AND Transport))}. Thisnecessitates processing of the underlying files/tables ZATOT and WATOT.When the results in the form of traversal paths are displayed, the usercan select/activate one of these paths and at the same time be given anoption to select Zone Order. That is, in what order she wants totraverse the zones containing one or all pairs of word-values. The valueset for Zone Order can for instance be {by appearance, zone size, zoneweight, inside out}., the latter triggering a ranking from those zoneswith highest density values and outwards. <opeari> (<atot-col>, <val>)|The dyadic ‘arithmetic operators’ are used in order to calculate ‘new’values according to the user requests. There are several attribute typesin the various sets of ATOT that include numeric value sets (size, awhole range of frequency measures, for instance the weights, densities,length, frequency, etc.). Note that the element <val> is defined asincluding ExFun. The new, derived value sets are stored in temporaryfiles/tables. In a similar way as above, a user request may involve aspecific grammatical pattern, as for instance (FOLLOWED-BY (EQ (Wo-GC,adjective)), (EQ (Wo-GC, adjective)). The operator FOLLOWED-BY isconstrained to operate against words in one sentence at the time. Whencomposing an operation like this (putting together basic buildingblocks), data about the words' relative position within the sentences isneeded. Data about positions are used in an arithmetic operation inorder to calculate the distance between the two adjectives. The resultmay then be displayed in order of decreasing distance (density). Thistype of derived data is stored in a temporary file named like‘Wo-GC-distance’. If the derived set of attribute types held in thetemporary file also includes the words' lemma, and possiblysemantic-class, it will be possible to apply AtFun. The construction ofthese specific types of requests (by composing building blocks andgenerating temporary files) is a matter of detailed design. Instead ofstoring information about the words' semantic class in the set ofattribute types WATOT, it is preferred to follow the guidelines givenfor TWS-procedure (Target Word Selection). Either from the outside in(from known thesauri entries onto words belonging to a certaingrammatical class), followed by inside out (from values in the setWATOT, near-by word types in case of match after outside-in, and ontoentries in a thesaurus). Requests based on simple grammatical patternsmay further be constrained to only consider word constellation inspecific sentences types (part of the set of attribute types in SATOT)and/or zone types classified according to the notion of DiscourseElements. The user may receive a marked display of nouns in the subjectposition within these sentences, and the cue phrases highlighted in theother sentences within the zone. (Sentences may be classified asimportant because they contain ‘important words (actors considered asimportant to the user group in question, and so forth).This particularexample, may preferably be positioned under the ‘Commix Modus Operandi’These examples are meant to explain the principles underlying ReFun andthat ReFun operating on a set ATOT can be applied in the same manner onall layers {Document, Text, Zone, Sentence, Word}.Basic ComponentsOperators

The operators are known in the prior art. Since they are included in theprevious section, they are clearly set out in the table below.

TABLE 9 <opebol>::= EXIST Operator returning the values 1 or 0 (TRUE orFALSE) <openeg>::= NOT Negation <operel>::= EQ | GT | GE| LT | LE | NE<opearil>::= PLUS | MINUS | PROD | DIV <opecal>::= AVG | SUM | MAX | MIN| NUMBER | other quantifiers This set of operators covers quantifiers,which are a prefixed operator that binds the variables in a logicalformula by specifying their quantity (Webster 1996). <opeclo>::= WITHIN| OVERLAP | ENCLOSING | other closure operators WITHIN is a kind ofco-occurrence operator and operates on an ‘inner object’ to see if this‘inner object’ is part of an ‘outer object’. <opedis>::= PRECEDES |FOLLOWS | GREP | other distance or proximity operators <opelog>::= AND |OR | XOR Binds together two or several Reduction Functions <openav>::=UP | DOWN | NEXT | PREVIOUS | SUB | SIB Navigational operators (somewhatsuperfluous since they partly overlap with other operator types. Thesystem generates chains stored in a DBP Information Chain that supportsnavigational operations.Values

The apparatuses, like in any other text processing system known in theprior art, operate on three different forms of values. The values allexist in the DBP but differ with respect to how they came into existenceand how they are processed. The values are at different abstractionlayers. The concept ‘values and their types’ is not be confused with‘value groups’: The short description below explains the presentinventions' preferred name ‘contacts’ referring to the content of thepanes displayed in the text sounding board.

Internal Values

The word internal actually means ‘existing or situated within the limitsor surface of something’. MAFS and DBPs are ‘the limits’. Internalvalues include all values stored. That is ATOTs and their named columnsare internal values (type level) as well as the values (occurrencelevel) stored in each column. They include all data types such asnumerical, alphabetical, string, tags, pointers, binary, labels, etc.

External Values

External values are the values coming in from the outside—that is valuesentering the system via the interface options. These values are given orselected by the user during her interactions with the underlying textvia the text sounding board. If the user selects values extracted bydevices embodied in the present invention, which are displayed forselection in one of the windowpanes in the text sounding board, thesevalues will coincide with internal values. This is the reason why theconcept of ‘contacts’ was introduced (the values are access points orcontacts to the underlying text).

The contacts or external values selected are not ‘terms’ with respect tohow the concept is used in traditional information retrievalsystems—they are coincident with internal values. Values given by theuser by chance (more like a free text search option) need of course notcoincide with the internal values, thus it is worthwhile distinguishingbetween these two types of external values—contacts and terms. Thedistinction in fact blurs the distinction between internal values andexternal values.

Derived Values

Derived values are values that result from applying functions oninternal values and/or external values. For instance the set of SentenceIdentifiers constituting a Zone Border is a set of derived values.Concatenated nouns in Norwegian, such as ‘oljeselskap’ decomposed intotwo nouns (values) ‘olje’ and ‘selskap’, and unfolded in Fan Structuresare also derived values (even if the components ‘olje’ and ‘selskap’also exist as internal values. If these components are linked by usingthe two link types <is a> and aspect of >, these links (a set composedof the source-id and target-id) are likewise derived values.

The triplets of contacts (‘triple tracks’) constitute a particular setof derived values (in that the set is based on recursive applications ofReFun and ExFun). The distinction between internal values and derivedvalues is not clear-cut since several derived values are consolidatedand pushed upwards to be stored as persistent in DBPs. As soon as aderived value is actually stored in one of the temporary filescomprising MAFS and DBPs, the value changes its status from derived tointernal. The reason for differentiating between them is the need fordenoting ‘things’ when specifying what the functions operate on.

The recursive definition of VALUE makes it possible to connect thevarious functions: ExFun, ReFun (element in ExFun) and AtFun (with ExFunas one of its elements).

TABLE 10 <val> ::= <string> | alpha-numerical value as for instance‘government’ <expression> | A constant, either internal or externalvalue, or an arithmetic operation on two constants (derived value).Expression may be used in order to construct values for navigation, thatis the identifier for ‘whatever’ that is next, previous, up, down in atemporary file. See <openav> in the section for operators. <ExFun> | Theinclusion of an option for activating ExFun explicitly states the needfor capturing intermediate internal values in a composite construction.<element> | Any kind of element defined as operations performed onresults transmitted from ExFun, which in part may call a ReFun (seebelow). <all> Refers to all values stored as associated to an ATOT (allthe attribute types) or all the values stored as associated to anATOT-column. <expression>::= <const> | <opeari> (<const>, <const>) constis an external value such as 250 or an arithmetic operation performed ontwo constants recursively <element>::= <opecal> (<ExFun>) | <opeclo>(<ExFun>) | <opedis> (<ExFun>) An element is defined as the applicationof three types of operations performed on the result from an ExtractionFunction. In particular for the intersection of chains, enclosure anddistance. For example, calculate the density for results (bundles ofsentences or subsets of pre-defined zones) transmitted by ExFun anddisplay as enclosed in zones with another feature set and only zoneswith a particular distance in between. <all>::= <val>Modus Operandi in More Detail

The table below shows the preferred names for the ‘modus operandi’ andtheir corresponding non-terminal code. The table elaborates the modusoperandi into more detail.

TABLE 11 <Modus Operandi> ::= Preliminary names <OPIMOD-1> | Plain ModusOperandi <OPIMOD-2> | Crafted Modus Operandi (qualified) <OPIMOD-3> |Quizzed Modus Operandi (multi-qualified) <OPIMOD-4> | Commix ModusOperandi <OPIMOD-5> | Virtuous Modus Operandi other Modus OperandiNameless Modus Operandi <OPIMOD-1> ::= (<DLOT>) gives the user anoverview of available exploration and navigation facilities. The DLOTare Text, Zone, Sentence and Word. <OPIMOD-2> ::= (<DLOT> (<OPEREL>(<ATOT> , <VAL>))) The user has selected one of the DLOTs and decided toopen a pane to see what kind of information is registered for each ofthem. She has for instance opened the DLOT text and gets aware of theinterlinked DLOT Word and the attribute types attached to this DLOT, Sheprefers to explore the attribute type ‘Word Focus’ which unfolds a panewith the values stored for this attribute type. She may then browsethrough the set of internal values and preferably select one of them.The word occurrences are immediately highlighted in the text pane. Theoption for enclosure (within default zone) can be activated, and theword occurrences reduced and displayed accordingly. The pane displayingthe lists of focused words can further be reduced to the subsetoccurring within zones. <OPIMOD-3> ::= (<ATOT> (<OPELOG> (<ReFun>)(<ReFun>))) At this level one can expect that the user has selected theDLOT (for instance a set of particular texts), and have decided to gofor a deeper investigation of another type of DLOT - the words and wordconstellations in the text. This means that some kinds of tracks (chaininformation) must preferably be displayed in the text sounding board.The tracks are the Zone Sensors that preferably must be givencomprehensive names. For instance, the user prefers to intersect thechains for ‘important actor’ (a reduction function applied on the chaincovering for semantic codes) and ‘verbs in the present’ (a reductionfunction applied on the chain covering the grammatical class ‘verb’ andthe grammatical form ‘tense present’). <OPIMOD-4> ::= (<ATOT> (<OPEDUO>(<ATOT>, (<AtFun>))) At this level the triple track preferably should bedisplayed in the text sounding board. The triple track involves acomplex mix of recursively activated functions constraining the displayof interconnected panes in the triple track. The modus operandi alsoprovides for all the navigational operators. The underlying filesinclude highly specialised information, among others the wordsgrammatical functions, relative position within sentences together withdifferent kinds of density measures, etc. This implies that theunderlying intersecting chains preferably should be optimised callingfor a particular DBP (DBP Triple Track). <OPIMOD-5> ::= (<AtFun>) |(<ATOT> (<OPEDUO> (<ATOT>, (<ELEMENT>))) See the description of ELEMENT(enclosing the advanced operators) under the section briefly describingthe component ‘values’ and the section Zonation Criteria. Allconstructions (groups of functions) should be reachable and HCI factorswill preferably guide the transitions between the other modus operandiand these most advanced facilities. See the specification for ELEMENTunder the section about values that incorporates the most advancedoperators. An orderly application of the groups of constructions willpreferably provide for virtuous exploration and navigation facilities.This includes options for the display of zones according to the‘authority norm space model’ as inscribed in the present invention'sunderlying document class model. The interchanging reflections betweentext zones and the triple track is preferably also included. That is,when the user navigates or traverses text zones, the content of the textzones are mirrored in the triple track giving a glance of some wordconstellations that are connected via underlying grammatical patterns,which in a previous pre-processing are captured and direct the moves inthe tracks. By a text driven reflection of features in the innercontext, the user is given support for the coupling between insight,chance and discovery - the three princes of Serendip as mentioned above.Conceptual Outline of Filtering Options

The users' intentions are at all times dependent on the situationalcontext, see section ‘The principles of text driven attentionstructures’. The following tables give a more detailed conceptualoutline of how the building blocks and functions are preferably combinedinto filtering options. The descriptions include the following elements,given in table 12. Only a subset of the filtering options are explainedin this form, but seen together with the section ‘Zonation Criteria’they convey the flexibility obtained by the present invention'sselectivity.

The building blocks are interlinked as defined in a formal grammar, andsection ‘Overview of functions and groups of constructions’ gives anoutline of this grammar.

TABLE 12 ID An identifier in order to refer to it Name A name describingits main functionality Definition A definition including the main searchoperands and operators. The search operand types are described in thesection ‘Conceptualisation of Building Blocks in Search Macros’.Principle A description of the underlying classification principle andhow building blocks may be combined in a predefined set of searchmacros. Prerequisite System requirements Expected result Increase orreduce the retrieved set. Each building block has a conceivablecounterpart. Example Several examples are described in the section‘Zonation Criteria’. Derived from The types of underlying annotations(grammatical, semantic and/or pragmatic) and structural information(text segments encoded according to an XML-scheme), and other meta-dataheld in the DC Element Set. - Intention The user intention preferablysupported.Reduction Filters

The main intention of the following filtering options is to reduce theretrieved set of zones and/or contacts.

Limit Contacts to Occur in a Specific Document Type

Table 13 outlines how to limit contacts to occur in a specific documentaccording to a wide set of criteria.

TABLE 13 ID 13.09.01-1 Name Limit contacts to occur in a specificdocument type Definition Restricting search scope to certain DocumentTypes. Principle The APOS is filtered against document types - that is,each document is assigned to a document class according to the presentinvention's preferred document class model. The four broad documentclasses (each class with subclasses) as defined in the document classmodel, will preferably be used to restrict the search span. Mostretrieval systems offer the option for restricting search spans byselecting database partitions. The classification criteria underlyingthe four main document classes are however different with respect tofeatures referring to the document's situational context. Theclassification criteria are based on the social relations between Senderand Receiver, and the Sender's authority. The document classes willsupport the definition of hyper textual links between zones (or othertext units) extracted from different documents. For instance, unitsextracted from Debate documents (newspaper) may in some way be relatedto utterances in the Negotiation documents (discussions in ministries)and further on in law proposals). The units are to be connected in ahypertext system that is predefined links between selected (extracted)text units (the links are considered as conceptual pathways through thetext base). Meta data about documents are search operands at documentlevel (document form descriptors). The retrieved set will be furtherreduced if the search is directed towards a limited set of documentswithin a document class. Document Class filters may be used incombination with Documental Logical Object Type filters or filters basedon DC element set (Time for document production, Author, etc). Users cancreate their own ‘virtual data bases’ by activating a stored searchmacro selecting documents from the source (global) database. Theactivation of a stored search macro ensures that the ‘virtual data base’reflects changes in the source database. Prerequisite Document types areclassified according to the document class model reflected in theextended Dublin Core Element Set (DCES). Restrict 1: Limit to documentspublished in a particular period of time. Time element is an importantpart of the request. Collection must support time period partitioning.Restrict 2: Limit to documents belonging to a certain class. Specificcriteria for documental context, selected sources, etc are an importantpart of the request. Restrict 3: Limit to documents that have, or do nothave, a certain word in their titles (main headings). Restrict 4: Limitto documents with certain keyness values, or keyness of keyness assignedin the new ‘Keyness Element’ (subordinate to the fixed element Keywordin DCES. The system must give the user options for selecting documentclasses and support the selection of contacts in combination with these.Document Class is a filter option. Expected The user has selected an APOtriplet as search result operand, or selected a subset of an APO Triplet(for example [Nurse as Agent], or a bundle of subsets of the APOTriplets (for example [Nurse as Agent OR Object] OR [Nurse OR Doctor asAgent], etc.). This is the current subset of APO Triplets. The userdecides to restrict the search span to a certain Document Class andaccording to the Sender's authority. Example (Current subset of APOTriplets) WITHIN (Document Class = Law)) Derived from Result fromdocument classification is embodied in a Dublin Core Element Setconsolidated in DBP Information Document Class and the documentstructure is consolidated in DBP Information Document Structure.Intention The intention is mainly to reduce the retrieved set andimprove precision. Precision is improved if a particular document,document class or document genre is required.Limit Contacts to Occur in Predetermined Textual Units

Table 14 outlines how to limit contacts to occur in predeterminedtextual units.

TABLE 14 ID 13.09.02-2 Name Limit contacts to occur in predeterminedtextual units, or a combination of units. The textual units areaccording to the set of (Documental Logical Object Type (DLOT)Definition Restricting search scope to certain Documental Logical ObjectTypes, either structural object types such as chapter, section, heading,sentence, and/or derived object types reflecting thematic issues, i.e.text zones. Principle The APO Triplets are filtered against DLOTs asencoded according to an XML-schema, i.e. well- formed documents, in DBPDocument Structure. Or the APO triplets are filtered against text zonesas specified in the DBP Information Zone. XML gives a richer opportunityfor combining and linking different DLOTs. Prerequisite The system mustgive the user options for selecting documental object types and supportthe selection of contacts in combination with these. APOS are generatedfor all titles and headers (classified and tagged as special types ofsentences). The documents have to be segmented and annotated (in XML).The device that identifies text zones according to the zonation criteriahas processed texts extracted from documents. The set of search operandtypes for DLOTs is described in the section ‘Conceptualisation ofBuilding Blocks in Search Macros’. User should be given options forrestricting ‘Search Scope’ (an option displayed in the main searchpane). Expected Users' moves restrict contacts so that they result occurin certain DLOTs. This retrieves a set of units for which the concept(contact) is important enough to be mentioned in their titles andfurther that the word/words also occurs in the nearest subsequentlogical object types. Example ((Nurse WITHIN DLOT Title) FOLLOWED-BY ((A= Nurse OR O = Nurse) WITHIN (DLOT Sentence D:2 echo Title))) Thecontact Nurse is in a title (heading) and the word nurse occurs eitherin the A or the O position in the 2 first sentences (D:2) following thesame (echo) title. (A and O in the APO Triplet). If a contact like<nurse> is combined with the object type <title> the user may expectthat this section of the document has nursing as one of its centralthemes. If the contact <nurse> is to be found in the A or O position inthe following first or second sentence (or zones), the weight is higher(depending on frequency and distribution with respect to the A and Opositions). Derived Extracted APOS from tagged logical object types.from Index structures for specific logical object types and organised asAPOS are derived from underlying tagged logical object types. Type:Documental Logical Object Type (DLOT). Intention The intention is mainlyto reduce the retrieved set of units or contacts and improve precision.Precision is improved under the assumption that words in titles signifythemes dealt with in underlying sentences/zones. If the word or a set ofwords occur as tagged with the grammatical functions subject or objectin the following nearest sentences or zones (or paragraphs) restrictedby a distance operator, the assumption is further that this signals moreweight given to the word/words.Limit Contacts on Proximity Criteria

Table 15 outlines how to limit contacts on proximity criteria.

TABLE 15 ID 13.09.02-3 Name Limit contacts on proximity criteriaDefinition Contacts in APOS occur closer another in the searched text.Principle When an element within an APO Triplet is selected the user canactivate an option for displaying contacts with a certain distance tothe selected set (or part of set). For instance displaying contactswithin a distance of 20 sentences to the left and right for the selectedAPO Triplet. This option has some similarity with KWIC (Key Word InContext) but differs in that the display is reduced to sets ofgrammar-based codes (contacts) in accordance with underlying grammarpatterns (realised as regular expressions combined in search macros).Prerequisite An APO Triplet (specific constellation of grammar basedcontacts) refers to one or several sentences. Each sentence has assignedan ID (Type: DLOT Identifier, a serial number or a relative position inthe text file. A proximity operator will display contacts derived from20 preceding sentences or 20 subsequent sentences with reference to thesentence referred to by an APO contact or all contacts in an APOTriplet. If the APO Triplet (or a contact within a triplet) has afrequency higher than 1, the software component must stack the sets fordisplay (Selection). A ranking procedure will assign weight to the stackmembers, for instance by giving higher priority to stack members inwhich a word in the Agent position also occurs in the Object position.The system must support proximity operators and operators like ‘within’or ‘enclosed by’, ‘followed-by’, ‘precedes’, etc. See section‘Operators’ Expected The APO Triplets as preferably embodied in thetriple result track will give the user and immediate glance of thecurrent APO Triplet's nearest inner context. The inner context that isrevealed is reduced and in accordance to the underlying grammar patternsand where the contacts are highlighted in the logical object typesspecified in the search macros (the distance operator may preferablyhave a default value that may be altered by the user, i.e. the defaultvalue is replaced by an open operand). Example Set Currency Indicator at(SELECTED APO) giving (LOT Sentence with LOT Identifier) (APO PRECEDES(DLOT Identifier D:20 echo Currency Indicator)) OR (APO FOLLOWS (DLOTIdentifier D:20 echo Currency Indicator)) Derived Triplet APO from <isinput to> Device Target Word Selection (TWS) <is realized as> TripleTrack Device Triplet Generator <produces> and is connected to theunderlying text via the SVO Triplet processed by the triplet generator:Device Triplet Generator <is member of> Set Device <produces> TripletAPO cat5 fac0 Subject Matter <is input to> Criteria Grammar Based <givesrules for> Set Database Partition (DBP) <is input to> Triplet SVO <isinput to> where the SVO Triplet: Triplet SVO <is derived from> DBPInformation Word <is input to> Device Target Word Selection (TWS) <isinput to> Device Triplet Generator SVO Entry Noun <is part of> SVO EntryVerb <is part of> Intersecting facets according to one of the categoriesspecified in the classification scheme can preferably also restrict thecontacts in the APO Triplet. cat5 fac0 Subject Matter <is abstractedinto> cat4 fac0 Subject Matter Complex <is input to> Device TripletGenerator <is a> Type: Category cat5 fac1 Agent <is subordinate to> cat5fac2 Process <is subordinate to> cat5 fac3 Object <is subordinate to>Intention Reduce retrieved set (sentences within a certain distance) andpossibly always in combination with a search scope restricted to certaindocument types. Improve precision regarding the textual context for thecontacts in the Subject or Object pane. Proximity or distance operatorswill impose the restriction that the contacts in the APOS must appearwithin a scope as specified in the distance operator. Zones/defined byzone borders) will preferably be a default operand for proximity, andwhere the distance operator regulates the distance between zones.Limit the Contacts Based on Frequency Information Combined withGrammatical Information.

Table 16 outlines how to limit the contacts based on frequencyinformation combined with grammatical information.

TABLE 16 ID 13.09.01-4 Name Limit the contacts based on frequencyinformation combined with grammatical information. Definition Words(preferably nouns) having a frequency above a certain threshold value(determine intratextually or intertextually) or a high keyness value arerepresented as either focused words or major descriptors respectively,(nouns contained in keyword filters). Focused words are elaboratedfurther in the section ‘Zonation Criteria’. Principle Each documententering the document collection is processed in order to computekeywords (words with high keyness value). See section ‘Apparatus foracquisition’. The lists of candidate keywords are processed by aPOS-tagger in order to identify nouns and preferably word lemma.Keywords that are nouns and their keyness value are stored as an orderedset in the DCES. All keywords derived from documents related to acertain domain (for instance Petroleum, EU enlargement, etc) are storedin separate keyword filters, i.e. chains connecting keywords andaccording to pragmatic criteria. By activating one of the keywordfilters, the lists in the Agent or Object panes are reduced accordingly.Only nouns occurring in the selected keyword chain are displayed in thepanes. Or nouns affected by the filter are highlighted in the panes(colour changes when activating keyword filter). Domain-specific keywordfilters may of course overlap. The overlap is managed by the structuredefined for the classification scheme in which a specific word type canbe a member of more than one facet. Keyword chains are also utilised inthe target word selection procedure aiming at constructing evolvingdomain specific thesauri. The frequency and keyness information isstored and managed in DBP Information Frequency and DBP InformationKeyness. Contacts with a high frequency (or a threshold frequency value)may be of importance to the user. If the selected search scope isintratextual, the default display of contacts in the APOS is preferablyby appearance or by consolidated frequency in preferably A. If theselected search scope is intertextual, the display should preferably berestricted to zones, (the user can at any point change options). Thesystem also supports an option for display based on frequency. Forinstance, for each contact in the A position the user may choose toorder them by frequency. This type of frequency information is assignedaccording to the word's Grammatical Function (GF). (The same types offrequency information may also be computed for the word's GrammaticalWord Class (GWC). There are basically four types of frequencyinformation: number of documents in which the word occurs with a GF (orGWC) number of zones in which the word occurs with a GF (or GWC) numberof sentences in which the word occurs with a GF (or GWC) total frequencyfor the word occurring with a GF (or GWC) This type of frequencyinformation can be used to filter the text with respect to all textunits. Prerequisite Documents are classified with respect to criteriafor domain (thematic field). DCE Domain Indicator <is part of> Type:Dublin Core Element Set (DCES), and the Domain Indicator(s) is/areeither derived from bibliographic information or is assigned to thedocument based on the accumulated results from a Target Word SelectionProcedure. When the system has generated a set of keyword filters, itwill also be possible to automatically (or at least semi-automatically)classify documents (assign a code for thematic field, that is - a DomainIndicator) based on the result from the keyness calculations. Keynesscalculations require an appropriate Reference Corpus: Device KeynessCalculation <produces> DBP Information Keyness <is part of> DeviceFrequency/Grammatical Distribution Calculation <is part of> DeviceSupport Document Analysis <is member of> Set Device DBP InformationReference Corpus <is input to> Output from keyness calculations (list ofwords with total and relative frequency and a keyness value) is input toa POS-tagger (a threshold value for keyness determines number of wordsto be included as input). The information is used during theclassification of documents. Device Support Document Analysis <is partof> Apparatus Acquisition <produces> Document Candidates Analysed <ismember of> Set Device Device Keyness Calculation <is part of> DocumentCandidates <is input to> Only nouns are selected from the output (fromPOS- tagger) stored as separate ordered sets or merged with existingwords in keyword chains referring to documents with shared features withrespect to the situational context. Expected Sentences from documentswith nouns that are result ‘unusual frequent’ as compared to the totalset of words in a reference corpus are selected for retrieval. Rankingbased on keyness or total/relative frequency. Display sentences fromdocuments having assigned the specific keywords in a DC element anddisplay sentences with the nouns in the A or O position. The nouns mayalso be displayed without the restriction that they are tagged assubject or object. Example ((A OR O) IN (DCE Keywords) WHERE (KeynessARIT- OPERATOR Value)). Keyness information will preferably not be madedirectly available to users, but used in intermediate filtering options.Derived Automatically constructed for each document from entering thedatabase. Intention Reduce retrieved set and improve precision (assignpriority through calculated keywords).Limit the Contacts Based on Frequency Information and the Intersectionof Chains.

Table 17 outlines how to limit the contacts based on frequencyinformation and the intersection of chains.

TABLE 17 ID 13.09.02-8 Name Limit the contacts by intersecting chains.Definition Selected contacts in the Agent and/or Object panes arerestricted according to four types of frequency information calculatedfor DLOT and combined with the intersection of chains. A chain is alinked list of words according to grammatical, semantic and/or orpragmatic criteria. Principle A Zone Sensor denotes a wide set of filteroptions that extracts nouns (or other word classes covered by chains)and intersects the chains with information about DLOTs, either sentencesor preferably zones. Information about the DLOTs is registered in DBPInformation Sentence, DBP Information Zone, and DBP Information Word.The zone sensor arranges the nouns in order of frequency (or in order offirst appearance or alphabetically), and/or in semantic classes (basedon pre-specified criteria), and transmits the result for display in thetext sounding board. The results are stored in an intermediate file,‘Zone Sensor’ managed in the Top Layer of the MAFS. The Zone Sensor canaccordingly operate on all feature sets registered and consolidated inthe interconnected DBPs. Device Zone Sensor <is part of> Device ZoneBond Generation <produces> Zone Sensor DBP Information Chain <is inputto> DBP Information Zone <is input to> Zone Sensor <is consolidated in>DBP Information Zone Bond <gives rules for> Text Sounding Board ContentPrerequisite The system must include frequency information assigned toeach contact in the APO Triplets so that the user has the option ofspecifying a minimum frequency (preferably at zone level). Expected Theintersection of chains, combined with frequency result information canfilter the content displayed in the text sounding board along alldimensions defined for the present invention' selectivity. ExampleExample - The Device Zone Identification has located at text zonecomprising 11 sentences in which the noun ‘Hensikt’ (Purpose) occurs(repetition of the word Hensikt in adjacent sentences, maximum distancefor adjacency = 1). The word ‘Hensikt’ is also considered to be a cuephrase signalling a discourse element (e.g., the purpose of thisdiscussion is . . .). The Device Zone Sensor extracts all the nouns fromthe Zone Link Set for this zone, selects the nouns also being repeatedand collocated with Hensikt, computes the frequency of the differentnouns, and transmits nouns with a frequency above a certain thresholdvalue (e.g., 2) to the text sounding board, these values will coincideSounding Board. When the user activates zones according to criteria inthe class (Discourse Elements), the text sounding board may guide herattention to the fact that this particular text has for example 4 zonesidentified according to the word Hensikt or one of its near-synonyms(e.g., ‘Formål’, Siktemål’). The user may then activate the displayedcontact Hensikt and ask for the display of more detailed informationabout each zone. This user request activates the Device Zone Sensor andthe words (nouns) in the left column show the nouns occurring more thantwice within Zone ‘Hensikt’, arranged in order of frequency. Semanticcriteria provide for an alternative arrangement or order of display.Derived The nouns to the right All nouns extracted from from arefiltered according the chain that intersects to frequency, and the thecurrent zone in words as displayed in alphabetical order. The thecontent panes are nouns are in either intersected with chains Subject orObject based on semantic position, (two other criteria. chains). Subsetof Nouns <within> Nouns <within> current current zone), order is zone)frequency. Chain Gas Management (4) Aktør (Actor)(1) intersectionsCompetitive Power Effektivitet (Efficiency (1) Continental ShelfEierskap (Ownership) (1) Value (2) Gassforvaltning (Gas Management) (4)Challenge (2) Konkurransekraft (Competitive Power) (3) Government (2)Kontinentalsokkel (Continental Shelf) (2) Indicating the thematicPerspektiv sensemes or linguistic (Perspective) (1) signs of ‘classesRegjering (Government) of meanings’: (2) The first and third Ressurs(Resource) (1) nouns are classified as focus words withinRestrukturering the theme of ‘oil (Restructuring) (1) affairs’. Thesecond and fourth Risiko(Risk)(1) noun are classified as focus wordswithin the Sammenheng (Connection) theme of ’economical (1) affairs.Siktemål (Objective) (1) The fifth noun belongs Stat (State) (1) to theset of cue phrases (discourse Utfordring (Challenge) elements)signalling (2) ‘something’ Verdi (Value) (1) related to problems. Thesixth noun refers Vurdering (Assessment) to an important actor (1)(government). Intention When displayed in the triple track, the userwill preferably get an immediate impression about the theme in the zone.If the same words occurred in a zone marked with an indicator for thediscourse element ‘Problem’, this would preferably influence on theuser's interpretation of the contacts reduced by combining frequency andthe intersection of chains.Cut Display

Table 18 outlines how to cut display at random.

TABLE 18 ID 13.09.02-23 Name Cut display Definition Submit only part ofthe retrieved contacts, selected at random. General filter to be appliedon top of other filters. Principle Precision-oriented moves have failedand the retrieved set of contacts is still too large (user's futilitypoint exceeded). It is assumed to be acceptable for the user to see afew contacts only (in order to get informed about the textual contentand based on this, enabling the user to select more discriminatingcontact from APOS) Prerequisite Simple randomising function activated onchains providing input to the APO triplets. Expected The user canexperiment with various minimum and result maximum values, which may becombined with one or several of the other filter options. Example APOTriplets wanted (= maximum Number) AT RANDOM (Minimum Value IN MaximumValue) APO Triplets (max = 100) AT RANDOM (1 IN 5), will display eachfifth APO Triplet of the 100 first triplets generated and temporary heldin a stack. Where the system proposes default values for minimum andmaximum based on the total number of APO Triplets referencing thecurrent document set. The default value can preferably be based oninformation derived from the Zone Link Set. Derived from DBP InformationFrequency <is input to> Device Zone Density Calculation <is a> SetDatabase Partition (DBP) DBP Information Keyness <is consolidated in>DBP Information Sentence Consolidated <is consolidated in> DBPInformation Word <is consolidated in> Intention Reduce set of displayedAPO Triplets at random.Negate APO Triplets and Store as User Profiles

Table 19 outlines how to negate APO Triplets (or contacts within) andstore as user profiles.

TABLE 19 ID 13.09.01-16 Name Negate APO Triplets or contact within APOTriplets and store as user profiles Definition Negating certain contactswill reduce APOS and the reduced APOS can be stored as a user-definedfilters that may be activated in future searches. Principle The user hassome experience in using the APOS and moving between/selecting contacts.One of the filter options may for instance be activated, and the useridentifies unwanted contacts (that is, the contacts are evaluated as nothaving the wanted discriminating effect or of no interest to the user.Such contacts may be eliminated from future displays of APOS by givingthe user an opportunity to negate these specific contacts and store thenegated words in a kind of stop-list, that is a user-defined filter.Prerequisite Multileveled Annotation File System (MAFS) in which userspecific profiles are stored and maintained in the Top Layer (furtherconsolidated into DBP Information User Profile, see section ‘UserProfile’. Expected The negated contacts are not deleted from the Topresult Layer, but marked as not to be included in future searches whenthe specified user (User Identifier) enters the system. Thus, the usercan at any time reset the specific contacts to be restored in futuresearches. The user has an option for defining user views for alldocuments in a collection or for subsets or in accordance with certaintasks (the user can specify several user views over the same set ofgrammar based contacts) Example Set Currency Indicator at (SELECTED APO)Set Flag IF (((NOT (A)) OR (NOT (P)) OR (NOT (O))) OR ((NOT (A AND P ANDO))) INCLUDE Flag in ((APO Triplet) WITHIN User Filter)) Derived fromUser demands and requirements activated for all components underlyingthe available options for filter and display. Intention Improveprecision based on experienceIntersect a Contact with a Free-Text Term

Table 20 shows how to intersect a contact with a free-text term.

TABLE 20 ID 13.09.01-7 Name Intersect a contact with a free-text termDefinition Selected contacts in the Agent and/or Object panes areintersected with free-text index terms. Principle Free-text index termsare restricted to words within sentences that are not extracted to beitems in the APOS. Other possible restrictions are nouns and verbs only(lemma or base form) and with a predefined minimum relative frequencywithin a document. Adverbs and adjectives may be important descriptorsin some requests, but these will be incorporated in predefined grammarbased search macros (combined regular expressions). For instance(Adjective Comparative PRECEDES D: 3 Nouns All) - list APOS where thesentences also contains an ‘adjective comparative’ preceding a nounwithin the distance of 3 words. An APO constellation refers to one orseveral sentences. The sentences are members of higher order documentalobject types. The free-text terms must preferably refer to higher orderobject types such as zones or sections, or whole documents. PrerequisiteThe system must allow for the combination of free-text terms and grammarbased contacts. The selection/specification of free-text terms must beavailable in a separate window. Free- text terms can also be generatedform the user profile (user-added text may be considered as a set offree-text terms, and disambiguated by grammar taggers). The user has anoption for selecting one or several terms from a list or the user canspecify one or several terms in a traditional free-text query. Thesystem allows for combining the user-defined search expression withsearch operands selected from the APOS. The user has the possibility toremove both free-text terms and fixed search operands (contact in theAPO Triplets) from the query. Expected If a free-text term like<paediatrics> is result combined with the contact <nurse> in either theA or O position, the segment may contain information about nursingrelated to paediatrics. Example Set Currency Indicator at (SELECTED APO)WHERE (A = Nurse OR O = Nurse) giving (DLOT Sentence with DLOTIdentifier) ((DLOT Identifier WITHIN DLOT Zone) WHERE (DLOT Word =Paediatrics)) Derived A free-text index includes all words not fromincluded in the APOS or A free-text index includes all words (excludewords in stop-lists). A and O items may for instance be reserved fornouns only, however there are a lot of nouns not having the syntacticalfunctions of Subject or Object in the sentences. Intention Improveprecision regarding the textual context for the selected nouns in theAgent or Object pane.Expansion Filters

The main intention of the following building blocks is to increase theretrieved set of text units or contacts. The expansion filters are moreor less counterparts to the Reduction Filters.

Filter: Incremental Aboutness

TABLE 21 ID 20.09.02-34 Name Incremental aboutness Definition Expandingsearch scope incrementally. Principle Based on document structure(structure specified according to an XML-scheme) or zones founded ongrammatical, semantic and/or pragmatic criteria. The filter optionsgenerate APO Triplets in a stepwise, incremental manner. APO Tripletsfor titles and section headings (following the hierarchical structure)APO Triplets for first sentences (1–2) in first paragraph followingsection headings APO Triplets for the first sentences (1–2) in eachparagraph APO Triplets for all sentences. APO Triplets for firstsentence in each zone, or first sentence in zone with highest weight ordensity first and then outwards (to zones with lower values for weightand density. APO Triplets for intersecting zones. APO Triplets for allsentences in all zones. When using this incremental filter option(interlinked set of search macros), the user must have available optionsfor activating/ deactivating other filer options at each level/ step.Prerequisite Information about DLOTs and Distance Operators. ExpectedAdvanced exploratory facility in which the user result can scan/browsethe APO Triplets in a top-down manner or inside out based on values forweight and density assigned to sentences and zones. Example 1 Set scanon DLOT Title AND DLOT Header Loop until end-of-scan SELECT APO WITHIN(DLOT Title OR DLOT Header) 2 Set scan on DLOT Header Loop untilend-of-scan Set Currency Indicator at DLOT Header with DLOT IdentifierSELECT APO WITHIN DLOT Sentence WHERE (DLOT Identifier D:2 echo CurrencyIndicator) 3 Set scan on DLOT Zone Loop until end-of-scan Set CurrencyIndicator at DLOT Zone with DLOT Identifier SELECT APO WITHIN DLOTSentence WHERE (DLOT Identifier D:2 echo Currency Indicator) Derived Theinformation about sentence locations is from derived via on of the twoDBPs. Zone Border includes information about the first and last sentenceof zones and the zone density makes it possible to traverse the zonesinside-out (from the zones with the highest density first and thenoutwards). DBP Information Document Structure <is input to> Device TextExtraction <is a> Set Database Partition (DBP) Device Document StructureIdentification <produces> DLOT Document Logical Object Type <isconsolidated in> MAFS Segmentation Information (ATF) <is consolidatedin> DBP Information Zone <is input to> Device Zone Bond Generation <isinput to> Device Zone Density Calculation <is input to> Device ZoneSensor <is input to> Device Zone Weight Calculation <is a> Set DatabasePartition (DBP) DLOT Zone <is an object in> Zone Border <is derivedfrom> Zone Density <is consolidated in> Zone Link Set <is consolidatedin> Zone Weight <is consolidated in> Intention Stepwise browsing of allavailable APO Triplets where the user can select an APO Triplet forfurther exploration at any level. The user can also activate otherfilter options in combination with the option for ‘incrementalaboutness’. A selected (activated and current) APO Triplet can furtherbe input to a search macro locating identical APO Triplets or APOTriplets with contacts partly overlapping with the current APO Triplet.Expand or Limit Contacts by Fan Structures.

Table 22 outlines how to contacts are limited or expanded by “fanstructures”.

TABLE 22 ID 13.09.02-8 Name Limit or expand contacts by fan structures.Definition Expand or restrict a focused word that is linked to a fanstructure. The concepts ‘focused word’ and ‘fan structure’ are describedin the section ‘Zonation Criteria’. search operand by selecting asemantic code linked to a contact via more specific levels in athesaurus Principle A contact displayed in the APO Triplets may bereplaced by more specific contacts unfolding in separate panes to theleft or right of each contact displayed in any of the tracks, grammartaggers known in the prior art, for languages with concatenations as acharacteristic feature are steadily improved by including tags markingconcatenations. Algorithms for splitting concatenated words into theirconstituent parts are known in the prior art. The present inventioncombines this knowledge with proximity criteria and insight about howthe author's language use fluctuate from general words to more specificwords, and seen as patterns with respect to distance. Fan structures areof particular value as a tool applied in the device that calculatesscores for connection points between sentences. If a certain contact,assigned to be a member of a chain for focused words, is registered ashaving attached a fan structure, the preferred option is that thedisplay should include an icon with the symbol of a fan. When the useractivates this button, preferably two interconnected panes will unfoldand display more details as shown in the example below. PrerequisiteGrammar taggers must handle concatenated words and tag them as such.Grammatical information is combined with frequency information in orderto divide the fan structures according to the members' total frequency(intratextually and intertextually. Derived For concatenations that havenouns as their from constituents, the present invention embodies adevice that generates fan structures superimposed on word sets organisedalong the dimension of general-specific. The device splits the fanstructure into frequency classes, and constructs links between words ifthey are related by lexical similarity between the components inconcatenated words. The words classified as unifying language use areplaced in the centre of link sets (forming unfolding fans in the textsounding board when displayed from the centre and then left and right).These are typically (in Norwegian) low or middle length words and theselection criteria is that the tagger has not classified them asconcatenated and that they have a frequency above a certain thresholdvalue determined intratextually. The constituents of concatenated words,and if the constituents are similar to words in the set of unifyingwords (centre words), the constituents are denoted as convergence wordsand linked to word types that are equal to the constituents. The linktype is either <is a> or <aspect of>, depending on whether theconstituents are the first part or last part of the concatenated word.Fan structures are generated intratextually, and subsets of encoded fanstructures can be transferred to cover for new texts if both sides (wordtypes) in the fan structures are registered as occurring in the newtext. Example The focused word in the centre and Word Convergenceunfolding on both sides. <aspect of> Focused Word <is a> SELSKAP(Company) Selskapsanalyse Oljeselskap Selskapsavtale ForvalterselskapSelskapsdeltakelse Transportselskap Selskapsnivå StatsaksjeselskapSelskapsorgan Gasselskap Selskapsside Aksjeselskap SelskapsskattMorselskap Selskapsstrategi Allmennaksjeselskap SelskapsstrukturEnergiselskap Selskapsverdi Enkeltselskap SelskapsavgiftOppstrømsselskap Selskapsstruktur Statsaksjeselskap SelskapsdannelseStatsselskap Selskapsfusjon Eierselskap SelskapsledelseElektrisitetsselskap Selskapsregime Nøkkelselskap SelskapsbalanseOperatørselskap Selskapsregulativ Produsentselskap SelskapsdriftSdøe-ivaretakerselskap Selskapsform Serviceselskap Intention The userswish to explore from the general to the specific level, and the fanstructures support this in a flexible manner. When the user traversesthe zones by following a chain based on focused words, the unfoldingpanes can dynamically reflect the specialisations of the focused word asthe zones are traversed.Definition of Some Important Terms Used in this Specification.Textual Contacts, or Simply Contacts

The index entries represented in the APO Triplets, which are a part of ahigher order representational form—Topic Frames', are terms extractedfrom the underlying grammatical annotated text base. Each word in themultileveled annotated file system has assigned an identifier (thedocument ID+the word's relative position within the file) and thereby itis possible to directly access the word or word constellation from whichthe index entry is derived. Since the index entries by this mechanismare connected to the underlying text, the index entries are denoted ascontacts in the sense that they are contact points to the underlyingtext. Through these connections the user may visit and explore the textsegments and select or discard the displayed segments.

Triplets of Contacts

The intended basic visualisation of contacts in windowpanes istentatively designed as a combination of three index entries referringrespectively to Agent, Process and Object. Each triplet containscollocating contact points to underlying text segments, collocating inthe sense that they represent collocating words in the underlying text.The triplet structure is a manifestation of three basic facets in theclassificatory meta-structure following principles adapted from the ideabehind ‘free faceted classification’, originally put forward byRanganathan. However, a set of grammar based extraction patterns is thesuperordinate principle underlying the actual extraction process.According to the principles underlying the free faceted classificationnorms, each facet may be further organised in rounds and levels. Eachround has several levels—levels with more detailed grammaticalinformation and levels with semantic information (abstraction levels).The highest level in each round is a set of predefined search macros andthe components in a search macro are regular expressions used forextracting words/word constellations from the text, further transformedto the representational form as prescribed for the basic tripletstructure.

Epitomic Triplets

The term is used in order to refer to the fact that the main APOTriplets represent a form of extreme summary of a written work(epitome). The term ‘epitome’ is synonymous with the term ‘synopsis’,the proposed preliminary name of the present invention.

Dublin Core

The Dublin Core is a set of 15 basic information elements designed foruse in Web pages to enhance indexing and retrieval. These elements are:title, creator, subject, descriptions, publisher, contributor, data,type, format, resource identifier, source, language, relation, coverage,and rights. Full, up-to-data details are available through the Web pagefor Dublin Core metadata element sethttp://purl.org/metadata/dublin_core/(last visited in October 2002.

Theme Frames

The terms ‘subject’, ‘theme’, and ‘topic’ are often defined as nearsynonyms. A preferred definition of ‘subject’ is Ranganathan's based onthe difference between extension and intention: “Subject is asystematised body of ideas, with its extension and intention fallingcoherently within the field of interest. It is also comfortably withinthe intellectual competence and the field of inevitable specialisationof a normal person.” (1987:28).

A related concept is ‘aboutness’; usually defined behaviouristic interms of the user's opinions about the relationship between what is inthe text and how the user perceives this content (content perceptionrelative to a particular person). A ‘Theme Frame’ is a representationalunit in the preferred embodiment of the present invention in which eachconstituent is expressed in terms of rules and guidelines as prescribedin a classification scheme. It is a framework for representing differentaspects of the theme within a textual unit such as sentence, zones orstructural segments as chapters, sections, paragraphs, etc. As such aTheme Frame includes the representations of ‘complex subjects’ with‘compound subjects’ as constituents, in turn having ‘basic subjects’ asconstituents.

Target Word Selection Procedure, Abbreviation TWS

The rounds and levels constructed for each component in the main tripletstructure (Agent, Process and Object) will contain index entries at forinstance a higher abstraction level than the contacts derived from theunderlying text. A target word selection procedure is a technique fordata abstraction where concepts encoded in domain-specific thesauri aremapped against contacts derived from the underlying text. If a contactreturns with the value ‘concept match’ during this procedure (severalcycles), there will be established a link between the contact and theconcept encoded in the thesaurus. The critical issue is not about how toestablish relations or what type of links or relations to use, butrather which relations will serve a user community.

Word Sense Disambiguation (WSD)

Disambiguation means to establish a single grammatical or semanticinterpretation of a word (or word constellation) as it appears in thetext. A Constraint Grammar tagger deals with the grammaticalambiguities. Taggers have an error rate (depending on language and textgenre), and resolving the meanings of multi-referential words to a fullextent will require validation procedures.

Words are character strings and even if their grammatical word class andgrammatical function are determined by a CG-tagger, a character stringmay have more than one meaning. For instance, a character string can bea homonym where the different referents are distinct. Target WordSelection procedures are techniques used for resolving some of thesemantic ambiguities, for instance by using the controlled vocabulariesencoded in thesauri limited to specific domains. Both WSD and TWS arerelated to issues of traversing databases according to specified rules.That is, traversing the grammatical encoded text files, chains, anddomain specific thesauri and to that extent found necessary in order toresolve ambiguities that seriously disturb the system's performance. Thedegree of grammatical and semantic disambiguation is an issue of costsas opposed to meaningful (coherent) content representations.

Subject Verb Object Structures (SVOS)

The grammatical subject of a sentence can be said to denote what thesentence is about while its predicate comments on this. The sentence‘Hydro is an oil company’ has ‘Hydro’ as its grammatical subject andit's predicate ‘is an oil company’ which comments on Hydro. The sentencestates a fact about Hydro and give information about Hydro. If theextraction patterns focus on the main sentence grammar components‘Subject Verb Object’, a collection of sentences about Hydro will resultin a structure of representations about Hydro. The grammar patternsgoverning the term extraction is a reduction process in that certainwords with certain grammatical functions within certain types ofsentences are qualified as input to the extraction procedure. In anykind of information representation there will be an information loss andthe critical issue is therefore to identify semantic categories ofspecial interest within the user community to be served by the searchmacros (regular expressions) transmitting data to the text soundingboard.

Agent, Process and Object Structures (APOS)

The SVOS are abstracted into a similar triplet structure for Agent andObject (preferably of transitive actions). The APO Triplets represent animportant reduction of all the SVO Triplets encoded in the bottom layerof the multileveled annotation file system. The reduction results fromthe set of grammar based extraction patterns operating on the bottomlayer. For instance by specifying that the only Subjects to be includedin the APO Triplets are those that satisfy the criteria ‘Noun andSubject’ and further that the noun also exists as encoded in a facetdenoting organisations related to Norwegian petroleum affairs.

RDF (Resource Description Framework

Is a technology proposed for the developing the so-called Semantic Weband in relation with eXtensible Mark-up Language (XML). Basically, thisis a simple structure for defining relations between semantic conceptsalso encoded in sets of triplets, however not with reference to thegrammatical structure of sentences in the text. The triplets of RDF formlinks information about related things in a similar way as concepts arelinked to each other in a thesaurus. In the ‘Semantic Web’ terminology,these structures of information are denoted as ‘ontology’. An extensionof a RDF contains assertions about facts, for instance ‘London is-aCity’, a technique quite popular in the earlier ‘expert systems’ encodedin programming languages such as PROLOG. The new aspect of the ‘SemanticWeb’ is that the RDFs are connected to URIs (Universal ResourceIdentifier).

The present invention is based on a quite different ideology formulatedwith respect to the objectives to be achieved by the proposed system fororganising information and theoretical principles that guide the design.This is the reason why it is preferred not to use concepts related tothe ‘Semantic Web’ technology aiming at different goals and withdifferent formalisation processes. The theoretical stance underlying thepresent invention is briefly described in the section ‘The principle oftext driven attention structures’.

REFERENCES

-   Aarskog, B. H. (1999): ‘Argumenterende tekst transformert til    hypertekst’, Unpublished dissertation, submitted to the University    of Bergen, Norway, July 1991. (In Norwegian)-   Blair, D. C. (1990): Language and representation. Amsterdam:    Elsevier.-   Remer, T. G. (ed.), (1965): Serendipity and the Three Princes of    Serendip; From the Peregrinaaggio of 1557. Norman, University of    Oklahoma Press.-   Werth, Paul (1999): Text worlds: Representing conceptual space in    discourse. Addison-Wesley Longman Ltd.-   Zipf, G. K. (1945): The meaning-frequency relationships of words.    Journal of General Computing, 33, 251-256.

1. A method for textual exploration and discovery comprising:annotating, with a processor, Subject-Verb-Object Structures (SVOS) in agrammatically encoded electronic text; and wherein said SVOS are used toidentify semantic facets termed “Agent”, “Process” and “Object”, i.e.APOS, in a text span, and wherein said semantic facets “Agent”,“Process” and “Object” are provided as index entries in respectivewindow panes on a display unit as contacts point to said electronictext.
 2. A method in accordance with claim 1, wherein three separatewindow panes are provided on the display unit.
 3. A method in accordancewith claim 2, wherein said triplets are based on a grammatical foundeddesign aiming at supporting exploration and discovery.
 4. A method inaccordance with claim 3, wherein the grammatical design is based ongrammatical annotation.
 5. A method in accordance with claim 4, whereinthe grammatical annotation is based on part-of-speech tagging(POS-tagging).
 6. A method in accordance with claim 4, wherein thegrammatical annotation is based on constraint grammars.
 7. A method inaccordance with claim 1, wherein said triplets are dynamically extractedfrom a grammatically encoded text.
 8. A method in accordance with claim1, wherein the user after having evaluated a set of contacts can open,and see directly into the text segment from which these contacts areextracted.
 9. A method in accordance with claim 1, wherein the SVOS areorganized in triplets.
 10. A method in accordance with claim 1, whereinthe APOS are organized in triplets, termed an “APO triplet”.
 11. Amethod in accordance with claim 1, wherein the user can explore thecontacts through various options for filtering and sorting.
 12. A methodin accordance with claim 1, wherein the contacts relates to facetsorganized according to at least one of grammatical, semantic orpragmatic features in the underlying text.
 13. A method in accordancewith claim 1, wherein classes of nominal expressions provide semanticroles for subsets of “Agents” and “Objects”.
 14. A method in accordancewith claim 1, wherein classes of verbal phrases provide semantic rolesfor a subset of “Process”.
 15. A method according to claim 1, whereinthe contacts are displayed in an interface in the form of windowsarranged for instance side-by-side, each window with options forexpansion or reduction, and options for displaying the underlying wordsas they appear in the text.
 16. A method for textual exploration anddiscovery according to claim 1, wherein; i) a first window pane containsindex entries of Agent structures containing noun units, and ii) asecond window pane contains index entries of Process structurescontaining verb units, and wherein this window pane is constrained toverb units that follow the noun units displayed in the text pane, andiii) a third window pane contains index entries of Object structurescontaining noun units constrained to noun units and verb units displayedin the text pane.
 17. Apparatus for textual exploration and discovery,wherein Subject-Verb-Object Structures (SVOS) are annotated in agrammatically encoded electronic text, and wherein said SVOS are used toidentify semantic facets termed “Agent”, “Process” and “Object”, i.e.APOS, in a text span, wherein the system comprises: a processor; memory;and a) an acquisition module for collection of documents, and capable offormatting the documents to at least one common format, b) asegmentation module for the generation of Annotated Text Files (ATF),thus forming the Annotated Text Corpus, and c) a Disambiguation Modulefor text disambiguation, and d) a display unit for presenting saidsemantic facets “Agent”, “Process” and “Object” as index entries inrespective window panes as contacts to said electronic text. 18.Apparatus according to claim 17, wherein the acquisition module arecapable of administering, indexing and querying large text corpora. 19.Apparatus according to claim 17, wherein the documents can be annotatedwith structural information and grammatical information.
 20. Apparatusaccording to claim 17, wherein the module provides a documental linkstructure, for instance, a group of peripheral documents are linked to acentral document, the central documents can be linked to each other, orperipheral documents associated with a central document may also belinked to another central document.
 21. Apparatus according to claim 17,wherein the module enables recording of various types of informationabout the texts, such as document source, collection date, personresponsible for collecting it, language, copyright status, formatinformation and version information.
 22. Apparatus according to claim17, wherein the segmentation process includes metadata assignment. 23.Apparatus according to claim 17, wherein the segmentation processapplies the Dublin Core Metadata Element Set.
 24. Apparatus according toclaim 17, wherein a multileveled annotation file system is constructed.25. Apparatus according to claim 24, wherein the Disambiguation moduledeals with techniques for converting output from Constraint Grammartaggers (CG-tagger) into an annotation format in compliance with thestructure/architecture specified for the Multileveled Annotation FileSystem (MAPS).
 26. Apparatus according to claim 25, wherein extractedsubsets of grammatical tags (codes) are combined with a selected set ofsemantic codes.
 27. Apparatus according to claim 25, wherein specialcodes describing different linguistic features are assigned to the wordsin the texts.
 28. Apparatus according to claim 17, wherein the moduleprovides a framework based on triplets in the basic form Subject VerbObject Structures (SVOS).
 29. An apparatus according to claim 17,wherein said display unit provides three separate window panes.
 30. Theapparatus according to claim 29, wherein; a first window pane containsindex entries of Agent structures containing noun units, and a secondwindow pane contains index entries of Process structures containing verbunits, and wherein this window pane is constrained to verb units thatfollow the noun units displayed in the text pane, and a third windowpane contains index entries of Object structures containing noun unitsconstrained to noun units and verb units displayed in the text pane.